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FOREWORD 


The  material  presented  in  this  report  constitutes  a  portion  of  the  author’s  doctoral 
thesis  from  Texas  A&M  University,  entitled  “Segmented  Chirp  Features  and  Hidden 
Gauss-Markov  Models  for  Transient  Signal  Classification.”  In  addition  to  the  material 
presented  here,  the  thesis  discusses  the  feature  set  used  to  characterize  segments  of 
wandering-tone  transients,  and  it  contains  a  lengthy  chapter  on  simulation  exper¬ 
imental  results  for  classifying  these  signal  types.  Since  the  theoretical  developments 
relating  to  continuous-state  and  Gaussian  hidden  Markov  models  should  be  of  interest 
to  a  much  wider  audience,  it  seems  desirable  to  have  a  self-contained  discussion  of  these 
topics,  without  the  “baggage”  of  that  additional  material.  This  report  provides  such  a 
discussion. 

Selected  portions  of  this  material  have  also  been  submitted  to  the  IEEE  Transac¬ 
tions  on  Signal  Processing  in  a  proposed  article  entitled  “Hidden  Gauss-Markov  Models 
for  Signal  Classification,”  by  P.  L.  Ainsleigh,  N.  Kehtamavaz,  and  R.  L.  Streit.  The 
space  constraints  of  such  a  journal  preclude  the  full  development  given  here,  however. 
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THEORY  OF  CONTINUOUS-STATE  HIDDEN  MARKOV  MODELS 
AND  HIDDEN  GAUSS-MARKOV  MODELS 

1.  INTRODUCTION 

Two  of  the  fundamental  problems  that  have  driven  the  development  of  signal 
processing  algorithms  during  the  past  several  decades  are  tracking ,  whose  goal  is  to 
characterize  the  time  evolution  of  a  “source”  through  space,  frequency,  or  some  other 
physical  variable,  and  classification,  whose  aim  is  to  make  a  decision  about  the  nature 
of  a  source  from  its  observable  characteristics.  Tracking  can  play  a  significant  role  in 
classification  since  much  of  the  information  about  the  nature  of  a  source  can  be  inferred 
from  the  way  that  it  “moves”  through  the  domain  of  a  physical  variable.  In  spite  of 
this,  the  two  fields  have  historically  evolved  independently,  giving  rise  to  two  separate 
families  of  tools.  From  the  tracking  side  comes  the  Kalman  filter  and  a  number  of 
related  smoothing  algorithms.  From  the  classification  side  comes  the  hidden  Markov 
model  (HMM)  and  its  associated  algorithms.  This  report  unifies  these  two  bodies  of 
theory. 

The  idea  of  an  equivalence  between  Kalman  filters  and  HMMs  will  come  as  no 
surprise  to  many  readers,  for  the  analogous  nature  of  these  two  areas  has  been  noted  in 
the  literature  for  many  years.  The  specifics  of  the  equivalence  may  be  more  surprising, 
however.  The  Kalman  filter  is  not  just  analogous  to  an  HMM.  The  Kalman-filter  model 
is  an  HMM  with  linear  Gaussian  model  densities.  The  Baum  algorithm  for  this  HMM 
is  a  Kalman  smoother,  as  is  the  Viterbi  algorithm.  The  likelihood  defined  by  the 
HMM  criteria  is  analytically  the  same  as  the  likelihood  defined  for  a  Kalman  filter. 
Each  iteration  of  the  expectation-maximization  (EM)  parameter-estimation  algorithm 
for  Kalman-filter  models  maximizes  an  auxiliary  function  whose  structure  is  identical 
to  the  function  whose  maximization  gives  the  Baum- Welch  re-estimation  formula  for 
HMMs.  The  equivalences  are  demonstrated  here  in  great  detail,  thus  removing  any 
ambiguity  about  the  relationship  between  these  two  tools. 

1.1  BACKGROUND 

The  material  presented  in  this  report  resulted  from  an  investigation  of  methods  for 
classifying  certain  types  of  transient  signals.  In  particular,  this  work  is  motivated  by  a 
persistent  difficulty  that  occurs  in  the  design  of  statistical  signal  classifiers.  The  source 
of  this  difficulty  is  the  high  dimensionality  of  feature  sets  required  to  adequately  describe 
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signals  from  each  class,  which  often  results  in  difficult  or  impossible  probability-density 
estimation  problems  [1].  An  approach  for  dealing  with  this  issue  is  to  segment  signals  in 
time,  develop  a  low-dimensional  feature  set  that  provides  good  signal  approximations 
on  the  shorter  time  segments,  and  use  a  parameterized  stochastic  model  to  evaluate 
the  time  evolution  of  the  feature  values  relative  to  each  class.  The  objective  of  such  an 
approach  is  to  replace  a  high-dimensional  time-invariant  probability  distribution  with  a 
low-dimensional  time-varying  distribution,  thereby  exchanging  the  difficult  problem  of 
estimating  densities  in  high-dimensional  feature  spaces  for  an  easier  model  parameter 
estimation  problem. 

In  addition  to  alleviating  the  dimensionality  problem,  this  type  of  stochastic  fea¬ 
ture  modeling  helps  to  accommodate  the  within-class  variability  that  is  commonly  en¬ 
countered  in  real-world  signal  classes.  It  is  able  to  do  this  because  the  stochastic  model 
allows  for  “controlled  uncertainty”  when  characterizing  the  feature  evolution,  effectively 
forming  an  entire  family  of  template-like  models  that  account  for  the  different  variations. 

Stochastic  feature  modeling  is  considered  here  in  the  context  of  Tnfl.Tnnrmm-1iVpli>moH 
classification,  which  is  performed  in  a  conventional  or  a  class-specific  framework  [2].  In 
the  conventional  mode,  classes  are  distinguished  whose  signal  segments  are  all  par¬ 
simoniously  represented  by  the  same  set  of  features.  The  features  are  tracked  nsing 
a  parameterized  model  M(@).  Sequences  of  feature  values  from  different  classes  are 
represented  using  different  values  for  the  parameter  set  ©,  which  are  estimated  by  max¬ 
imizing  the  likelihood  of  a  set  of  labeled  sequences  (i.e.,  a  training  set)  from  each  class 
After  ©  is  estimated  for  each  class,  an  unlabeled  feature  sequence  Z  is  classified  by 
assigning  it  to  the  class  for  which  the  class-conditional  likelihood  £{Z|0}  is  maYimnm 
In  the  class-specific  framework,  distinct  feature  sets  are  used  to  represent  signals  from 
different  classes.  While  the  structure  of  the  model  M (©)  for  different  classes  may  be 
different  in  this  framework,  the  parameters  in  any  given  class  model  are  still  estimated 
by  maximizing  the  likelihood  of  a  set  of  training  features  from  that  class.  Further¬ 
more,  the  class-conditional  likelihood  £{Z\M(@)}  is  still  used  as  a  measure  of  nW 
membership,  although  a  “change-of-measure”  operation  may  be  required  before  it  can 
be  compared  to  class-membership  measures  for  other  classes.  In  both  the  conventional 
and  class-specific  scenarios,  then,  the  fundamental  problems  in  stochastic  modeling  for 
classification  are  model  parameter  estimation  and  likelihood  evaluation. 

One  possible  choice  for  ,M(©)  is  the  linear  Gaussian  state-space  model,  or  Kalman- 
filter  model,  which  is  general  enough  to  accommodate  a  wide  range  of  classes  and  fea¬ 
tures.  This  model  is  supported  by  an  extensive  body  of  theory  that  has  evolved  out 
of  the  tracking,  control,  and  optimal  filtering  arenas  [3-6],  but  it  has  not  traditionally 
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been  used  as  a  classification  tool.  Another  potential  choice  for  M(@)  is  the  family  of 
HMMs,  which  have  been  used  very  successfully  to  classify  sequences  of  Fourier-  and 
cepstrum-based  features  in  speech  recognition  [7,8].  Traditional  HMMs  and  the  pop¬ 
ular  Baum- Welch  training  algorithm  [9],  however,  constrain  the  state  vectors  to  reside 
in  a  discrete  state  space,  making  them  unsuitable  for  signals  whose  features  vary  con¬ 
tinuously  as  a  function  of  time.  A  component  of  the  present  work  is  the  development 
of  continuous-state  HMMs  (GS-HMMs)  that  are  better  able  to  handle  continuously 
varying  feature  sequences.  The  CS-HMM  algorithms  are  then  specialized  to  the  linear 
Gaussian  models,  which  are  referred  to  in  this  context  as  hidden  Gauss-Markov  models 
(HGMMs). 

While  the  focus  in  this  report  is  on  classification,  the  theory  may  also  be  of  interest 
in  tracking  applications  due  to  the  prominence  of  the  Kalman  filter  as  a  tracking  tool. 
A  common  criticism  of  the  Kalman  filter  is  that  it  gives  too  much  credence  to  the  model 
and  not  enough  to  the  observed  data.  A  symptom  of  this  “narcissistic-model  syndrome” 
is  that  the  error  covariances  depend  only  on  the  a  priori  values  of  the  model  parameters 
and  are  independent  of  the  observed  data.  An  improved  tracker  might  be  obtained  by 
including  a  parameter-updating  scheme  during  the  processing  of  observed  data,  as  is 
done  during  the  training  phase  for  an  HGMM.  Furthermore,  the  general  development 
of  CS-HMMs  could  provide  a  framework  for  investigating  tracking  algorithms  based  on 
non-Gaussian  models. 

1.2  RELATED  LITERATURE 

Discrete-state  HMMs  (DS-HMMs)  were  developed  by  Baum  and  his  colleagues  in 
the  late  1960s  and  early  1970s  [9-12].  Baum’s  work  includes  a  method  for  estimating  the 
sequence  of  individually  most  likely  hidden  states  (referred  to  as  the  Baum  or  forward- 
backward  algorithm)  and  a  method  for  estimating  the  parameters  in  the  HMM  (referred 
to  as  the  Baum- Welch  algorithm).  An  alternative  approach  to  state  estimation  is  to  seek 
the  single  most  likely  sequence  of  states,  which  is  obtained  using  the  Viterbi  algorithm 
[13, 14].  Dempster  et  al.  [15]  observed  that  the  Baum- Welch  algorithm  for  estimating 
the  HMM  parameters  is  an  example  of  the  EM  algorithm.  Heiser  has  since  noted  that 
the  EM  algorithm  is  a  special  case  of  iterative  majorization  [16]. 

In  1980,  Ferguson  [17]  helped  to  solidify  HMM  theory  for  speech  recognition  by 
outlining  the  three  fundamental  problems,  namely,  state  estimation,  likelihood  evalu¬ 
ation,  and  parameter  estimation.  This  “HMM  paradigm”  for  speech  recognition  was 
developed  further  by  Levinson  et  al.  [18]. 
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Most  extensions  of  the  basic  HMM  structure  have  focused  on  obtaining  more 
general  output  densities.  Liporace  [19]  treated  HMMs  with  elliptically  symmetric  con¬ 
tinuous  output  densities.  Juang  et  al.  [20]  relaxed  the  elliptical  symmetry  requirement 
by  treating  models  with  Gaussian-mixture  output  densities.  Other  continuous-output 
HMMs  include  those  whose  measurements  axe  governed  by  an  autoregressive  process 
[21-23],  a  polynomial  regression  function  [24],  and  a  linear  Gaussian  model  [25].  When 
employed  with  models  having  variable-duration  states  [26],  such  continuous-output  dis¬ 
tributions  lead  to  the  versatile  class  of  segmental  models  [27]. 

In  contrast  to  these  continuous-output  models,  the  general  class  of  CS-HMMs  has 
received  very  little  attention  in  the  signal  processing  literature.  Linear  Gaussian  models, 
on  the  other  hand,  have  been  extensively  studied  in  the  Kalman-filter  context,  although 
it  is  not  generally  known  that  these  models  are  examples  of  CS-HMMs.  An  excellent 
review  of  the  early  history  of  Gaussian  models  in  a  linear-filtering  context  was  given  by 
Kailath  [28].  Methods  for  estimating  parameters  in  these  models  are  reviewed  below. 

Most  early  applications  of  Kalman-filter  models  in  the  engineering  literature  con¬ 
sidered  the  model  parameters  (i.e.,  the  transition,  output,  and  covariance  matrices)  to 
be  known  from  physical  considerations,  so  that  parameter  estimation  was  not  an  issue. 
Two  notable  exceptions  are  found  in  the  works  by  Kashyap  [29]  and  Gupta  and  Mehra 
[30],  who  used  gradient-based  nonlinear  optimization  techniques  to  maximize  the  like¬ 
lihood  as  expressed  in  terms  of  the  measurement  innovations.  Parameter  estimation 
in  Gaussian  models  is  more  prominent  in  the  time  series  and  econometrics  literature, 
where  the  first  applications  of  the  EM  algorithm  for  this  purpose  appeared.  In  the  early 
1980s,  Shumway  and  Stoffer  [31]  and  Watson  and  Engle  [32]  independently  derived  EM 
algorithms  for  estimating  parameters  in  time-invariant  Gaussian  models  with  linear  con¬ 
straints.  While  the  general  thrusts  of  these  references  are  the  same,  they  have  a  number 
of  distinguishing  features.  First,  while  Shumway  and  Stoffer  assume  that  the  output 
matrix  is  known  a  priori,  Watson  and  Engle  estimate  the  output  matrix  along  with  the 
other  model  parameters.  Second,  Shumway  and  Stoffer  provide  an  explicit  recursive 
expression  for  the  cross  covariance  between  time-adjacent  states,  which  must  be  calcu¬ 
lated  as  part  of  the  expectation  portion  (or  E-step)  of  the  EM  algorithm.  Watson  and 
Engle,  on  the  other  hand,  obtain  the  cross  covariance  implicitly  by  using  an  augmented 
state  vector  during  the  Kalman  smoothing  algorithm  that  lies  at  the  heart  of  the  E-step. 
Finally,  Shumway  and  Stoffer  derive  the  EM  algorithm  by  defining  and  maximizing  the 
auxiliary  function  explicitly,  while  Watson  and  Engle  formulate  the  algorithm  in  terms 
of  sufficient  statistics. 
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Building  on  this  earlier  work,  signal  processing  researchers  began  using  the  EM 
algorithm  with  linear  Gaussian  models  in  the  early  1990s.  Ziskind  and  Hertz  [33]  used 
this  approach  to  estimate  directions  of  arrival  for  narrowband  autoregressive  processes 
received  on  a  multisensor  array.  Weinstein  et  al.  [34]  estimated  the  parameters  in  a 
linear  Gaussian  model  while  performing  noise  removal  in  signals  received  on  a  pair 
of  sensors  with  known  coupling.  Digalakis  et  al.  [25]  extended  the  EM  algorithm  to 
treat  time-varying  models  and  used  the  linear  Gaussian  model  to  represent  segments 
of  speech.  Deng  and  Shen  [35]  later  provided  a  decomposition  algorithm  to  speed  the 
processing  of  the  EM  algorithm  for  high-dimensional  state  spaces.  Finally,  Elliot  and 
Krishnamurthy  [36]  derived  an  efficient  filter-based  implementation  of  the  correlation 
matrices  required  during  the  E-step  Of  the  EM  algorithm.  This  filter-based  approach 
dispenses  with  the  need  for  the  more  computationally  demanding  Kalman  smoother, 
which  could  make  the  EM  algorithm  much  more  amenable  to  real-time  processing. 

An  extension  of  the  linear  Gaussian  model  that  is  particularly  relevant  to  the 
current  work  was  provided  by  Sorenson  and  Alspach  [37]  and  Lo  [38],  who  derived 
Kalman-filter  recursions  for  models  with  a  Gaussian-mixture  prior  for  the  initial  state, 
as  opposed  to  the  single  Gaussian  density  that  traditionally  governs  the  initial  state. 

Finally,  the  unification  of  HMMs  and  Kalman  filters  was  initiated  a  decade  ago 
by  Streit  [39],  who  demonstrated  that  the  Baum  and  Viterbi  algorithms  for  a  CS-HMM 
with  Gaussian  model  densities  generate  the  same  state  vectors  and  covariance  matrices 
as  are  obtained  using  a  fixed-interval  Kalman  smoother.  Streit  was  considering  these 
algorithms  in  the  context  of  the  tracking  and  not  the  classification  problem,  however, 
so  that  likelihood  evaluation  and  model  parameter  estimation  were  not  at  issue.  This 
limited  focus  allowed  Streit  to  ignore  both  the  scaling  constants  on  the  Baum  probability 
densities  and  the  joint  densities  of  adjacent  states  in  his  characterization  of  the  optimal 
state  sequences. 


1.3  ORIGINAL  CONTRIBUTIONS 

This  report  presents  a  general  theory  for  CS-HMMs,  independent  of  the  particular 
form  of  the  model  densities,  addressing  the  state  estimation,  likelihood  evaluation,  and 
parameter  estimation  problems  as  outlined  for  DS-HMMs  by  Ferguson  [17].  While 
continuous-state  versions  of  the  Baum  and  Viterbi  algorithms  are,  for  the  most  part, 
straightforward  extensions  of  their  discrete-state  counterparts  and  are  obtained  using 
methods  outlined  by  Jazwinski  [40],  the  Baum  and  Viterbi  algorithms  treat  only  state 
estimation  and  likelihood  evaluation.  The  present  research  also  addresses  parameter 
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estimation,  to  the  extent  possible,  by  giving  a  general  formulation  of  the  EM  .auxiliary 
function.  This  formulation  solves  the  E-step  of  the  EM  algorithm.  The  M-step  can 
then  be  defined  by  maximizing  the  auxiliary  function  after  the  form  is  specified  for  the 
model  densities. 

The  CS-HMM  results  are  then  specialized  to  HGMMs.  With  regard  to  the  Baum 
and  Viterbi  algorithms  for  HGMMs,  this  research  extends  the  results  given  by  Streit 
[39]  in  both  scope  and  substance.  Streit  demonstrated  the  equivalence  of  the  Baum 
and  Viterbi  state  sequences  with  those  generated  by  the  fixed-interval  smoothing  al¬ 
gorithm  developed  by  Rauch,  Tung,  and  Striebel  (commonly  referred  to  as  the  RTS 
smoother )  [41].  It  is  shown  here  that  Baum’s  forward-backward  algorithm  is  more  nat¬ 
urally  equated  with  the  two-filter  implementation  of  the  smoother  given  by  Mayne  [42] 
and  Eraser  and  Potter  [43],  which  employs  forward-  and  backward-running  Kalman  fil¬ 
ters  and  then  estimates  the  state  sequence  by  optimally  combining  the  estimates  from 
the  two  filters.  The  Viterbi  algorithm  does  lead  naturally  to  the  state-estimation  por- 
.  tion  of  the  RTS  smoother,  but,  as  noted  by  Streit  [39],  it  does  not  give  the  state  error 
covariances.  The  covariance  matrices  from  the  RTS  algorithm  are  obtained  here  using 
a  new  joint-density  marginalization  approach,  which  provides  a  more  natural  derivation 
of  the  RTS  algorithm  from  the  HMM  point  of  view. 

This  report  presents  a  new  Gaussian  refactorization  lemma.  While  the  original 
motivation  for  this  lemma  was  to  characterize  a  recurrent  set  of  operations  when  deriving 
the  HGMM  Baum  and  Viterbi  algorithms,  it  turns  out  that  this  lemma  provides  a 
natural  alternative  derivation  of  the  Kalman  filter  recursions. 

The  conditional  joint  density  of  time-adjacent  states  in  HGMMs  is  also  derived. 
In  addition  to  facilitating  the  theoretical  development  of  the  parameter  estimation  algo¬ 
rithm,  this  derivation  leads  to  a  new  expression  for  the  cross  covariance  between  states 
that  is  considerably  simpler  than  the  recursive  definition  given  by  Shumway  and  Staffer 
[31].  The  simplicity  of  the  new  expression  occurs  because  the  cross-covariance  matrix 
is  viewed  within  the  larger  context  of  the  joint  density,  as  opposed  to  its  being  derived 
by  taking  the  expectation  of  the  appropriate  outer  product. 

In  addressing  likelihood  evaluation  and  parameter  estimation  for  HGMMs,  this 
report  shows  two  additional  equivalences  between  HMMs  and  Kalman-filter  models. 
First,  Baum’s  forward  recursion  for  marginalizing  out  the  states  from  the  joint  distribu¬ 
tion  of  the  states  and  measurements  yields  the  classical  expression  for  the  measurement 
likelihood  from  Kalman  filter  theory  [44].  Second,  the  continuous-state  formulation  of 
the  Baum- Welch  auxiliary  function  leads  to  the  existing  EM  algorithm  for  estimating 
the  parameters  in  Kalman-filter  models. 
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In  addition  to  these  results,  the  CS-HMM  and  HGMM  algorithms  are  extended 
to  accommodate  prior  state  densities  that  axe  composed  of  mixtures.  These  new  devel¬ 
opments  provide  substantial  extensions  of  previous  work  with  mixture-based  Kalman 
filters  [37,38],  first  by  generalizing  the  mixture-based  algorithms  to  the  larger 'class  of 
CS-HMMs,  and  then  by  addressing  the  smoothing  and  parameter  estimation  problems. 
In  the  classification  context,  the  inclusion  of  mixture-based  prior  densities  allows  the 
model  to  accommodate  even  greater  amounts  of  within-class  variability  than  can  be 
handled  by  “single-mode”  models. 

All  of  the  above  results  axe  given  for  models  whose  parameters  axe  time  varying. 
For  HGMMs  with  parameters  that  are  constant  across  time,  the  measurement  likeli¬ 
hood  is  shown  to  be  invariant  to  a  family  of  similarity  transformations  of  the  model 

parameters. 

This  report  provides  a  unified  context  from  which  to  view  HMMs  and  Kalman 
filters  for  both  classification  and  tracking.  The  literature  review  given  above  provides  a 
sampling  of  the  very  rich  history  that  has  unfolded  in  connection  with  these  families  of 
tools.  For  the  most  part,  this  history  has  evolved  along  two  separate  paths,  one  leading 
from  Baum,  the  other  leading  from  Kalman.  It  is  shown  here  that  these  two  paths  are 

really  one. 
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2.  CONTINUOUS-STATE  HIDDEN  MARKOV  MODELS 

In  general,  discrete-time  HMMs  represent  a  sequence  of  observed  M-dimensional 
measurements  Zjv  =  {zi,  z2, . . .  ,zx},  made  at  times  tn,  n  =  1, 2, . . . ,  N,  as  probabilistic 
functions  of  a  sequence  of  unobservable  L-dimensional  states  XN  =  {xo,Xi, . . .  ,Xn}, 
where  Xo  is  governed  by  a  prior  distribution  and  does  not  correspond  to  a  measurement. 
If  the  Xn  are  constrained  to  take  values  from  a  discrete  finite-sized  alphabet,  then  the 
model  is  a  DS-HMM.  In  such  models,  there  is  no  need  to  specify  the  actual  element 
values  for  vector  x^  Only  the  index  into  the  alphabet  containing  the  different  values 
of  Xn  need  be  provided.  This  index  is  usually  denoted  as  i  or  j,  and  the  event  qn(i) 
indicates  that  the  ith  state  has  occurred  at  time  tn.  If  the  elements  of  x„,  are  free  to 
assume  values  on  the  real  line,  then  the  model  is  a  CS-HMM.  In  these  models,  the 
element  values  must  be  specified  since  the  indexing  of  all  possible  state  vectors  would 
require  an  uncountable  number  of  index  values. 

Regardless  of  whether  the  states  are  discrete  or  continuous,  they  are  usually  as¬ 
sumed  to  obey  a  first-order  Markov  process,  where  each  state  depends  only  on  the  state 
that  occurred  at  the  previous  time  instant.  These  first-order  HMMs  are  characterized 
by  three  model  probability  functions.  First,  the  state-transition  distribution  governs 
the  probability  of  moving  from  one  state  to  another  in  a  single  time  step.  Second, 
the  output  distribution  governs  the  probability  of  obtaining  a  particular  measurement, 
given  the  value  of  the  state  vector.  Finally,  the  prior  state  distribution  governs  the 
probability  that  the  state  at  time  to  will  take  on  a  particular  value. 

For  DS-HMMs,  the  prior  state  probabilities  and  state-transition  probabilities  are 
discrete  sets  of  numbers  denoted  by  71*  and  a^,  respectively,  for  states  with  indices  i  and 
j.  The  output  distribution  can  be  either  discrete  or  continuous,  and  is  denoted  bj(zn). 
While  discrete-state  models  work  quite  well  for  certain  classes  of  signals  and  features 
(e.g.,  spectral  or  cepstral  coefficients  of  speech),  they  are  ill-suited  for  feature  sequences 
that  follow  a  continuous  trajectory  through  state  space  (e.g.,  the  instantaneous  fre¬ 
quency  of  a  wandering  tonal).  For  these  types  of  signals,  continuous-state  models  allow 
a  more  accurate  representation.  The  model  distributions  are  denoted  in  this  case  by 
the  density  functions  p(xo|0o),  p(xn|xn_i,  0X),  and  p( z^x*;  6Z).  The  structure  of  these 
model  densities  is  usually  known  (e.g.,  Gaussian,  Rayleigh,  gamma),  but  the  parameter 
sets  0O,  and  Oz  must  be  estimated  for  each  class  from  training  data.  For  conve¬ 
nience  in  notation,  only  those  probability  models  having  well-defined  density  functions 
are  considered  here. 
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The  notation  for  discrete-  and  continuous-state  models  is  summarized  in  table  1. 
It  is  important  to  note  that,  when  moving  from  discrete  to  continuous  models,  expres¬ 
sions  such  as  p(x;  6)  no  longer  designate  probabilities,  but  likelihoods.  While  it  would 
be  technically  more  accurate  to  use  expressions  such'  as  £(x;  6)  for  CS-HMMs,  the  p- 
notation  is  used  to  follow  conventions  established  in  existing  literature.  The  notation 
is  abused  even  further  by  using  the  same  symbol  for  the  likelihood  of  an  event  x  (i.e.,  a 
particular  numerical  value  or  sequence  of  values)  and  the  probability  density  function 
of  a  random  variable  x.  The  context  can  serve  as  a  guide  for  the  meaning  of  the  expres¬ 
sion,  however,  since  the  difference  between  the  likelihood  function  and  the  probability 
density  function  is  merely  a  matter  of  whether  9  or  x  is  treated  as  unknown,  and  the 
likelihood  is  just  the  likelihood  function  evaluated  at  a  particular  value. 


Tabic  1.  Notation  for  Hidden  Markov  Models 


Discrete- State 
Model 

Continuous-  State 
Model 

State 

hi 

Xn 

Measurement 

Zn 

Zn 

State-Transition 

Probability 

p(Xn\Xn-l,6X) 

Output 

Probability 

*(*») 

P(  Zn\Xn,ez) 

Prior  State 
Probability 

7T* 

p(xoW0) 

xnii  section  derives  algorithms  for  likelihood  evaluation,  state  estimation,  and 
model  parameter  estimation  with  CS-HMMs.  For  discrete-state  models,  these  problems 
are  solved  using  the  Baum,  Viterbi,  and  Baum- Welch  algorithms  [9,13,17,18].  The 
continuous-state  versions  of  these  algorithms  are  derived  here.  While  the  continuous- 
state  analog  to  the  Baum- Welch  algorithm  cannot  be  completely  specified  without  as¬ 
suming  explicit  forms  for  the  model  densities,  the  auxiliary  function  for  the  EM  algo¬ 
rithm  can  be  formulated  in  a  density-independent  fashion.  In  addition  to  these  devel¬ 
opments,  the  CS-HMM  and  associated  algorithms  are  extended  to  include  prior  state 
distributions  ,  consisting  of  a  mixture  of  densities. 
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2.1  BAUM  PROBABILITY  DENSITIES 

The  state  evolution  in  CS-HMMs  is  characterized  by  the  joint  density  of  the 
measurement  and  state  sequences,  which  is  given  by 

■  N 

p(ZN,XN)  =  p(xo)  n  K^nWlKXiilXn-l)*  (1) 

n=l 

where  the  explicit  notational  dependence  on  the  model  parameters  has  been  dropped. 
This  expression  makes  two  basic  assumptions:  (1)  the  states  follow  a  first-order  Markov 
model  and  (2)  the  measurements  axe  conditionally  independent,  given  the  states.  Class 
assignments  axe  made  in  signal  classification  using  the  measurement  likelihood,  which 
is  obtained  by  marginalizing  equation  (1)  over  all  possible  state  sequences,  giving 

pC^n)  =  y>dXjvp(Z^,Xjv).  (2) 

Here,  the  shorthand  /  dX/v  denotes  the  multiple  integral  /  dxo  •  •  •  /  dx^,  where  each 
single  integral  /  dx^  is  an  L-dimensional  integration  over  state  space.  For  the  discus¬ 
sion  below,  it  is  also  necessary  to  introduce  the  partial  measurement  sequence  Zn  = 
{zi,  Z2, . . . ,  z„}  and  its  complement  in  ZN,  denoted  Z£  =  {zn+i,  zn+2, . . . ,  zN }. 

Became  of  the  intractable  computational  load  required  to  evaluate  the  discrete 
equivalent  of  equation  (2),  Baum  et  al.  [9]  developed  recursive  functions  to  characterize 
and  marginalize  the  joint  probability  for  DS-HMMs.  These  functions  are  defined  in 
table  2  using  both  the  discrete-state  and  the  continuous-state  notation.  The  algorithm 
for  computing  these  probability  (or  probability-density)  functions  is  referred  to  as  the 
Baum,  or  forward-backward,  algorithm.  The  Baum  recursions  for  DS-HMMs  are  given 
in.  rable  3  for  comparison  with  the  continuous-state  recursions  derived  in  this  section. 

2.1.1  Forward  Densities 
The  forward  densities  are  defined  as 

a(xn)  =  p(Zn,Xn) 

=  p(zn|Zn_1,xn)p(Zn_1,xn) 

=  p(2R|Xn)  J  d.Xn-1  j?(Zn_i,  Xn,  Xn—i) 

=  p(zn|Xn)  y'dxn_ip(xn|Zn_1,Xft_1)p(Zn_1,Xn_1) 

=  p{zn\Xn)  J  dXn-i  p^Xn^)  a(x„_i),  (3) 

where  the  recursion  defined  in  this  last  expression  is  initialized  as  a(x o)  =  p(xo). 
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Table  2.  Baum  Probability  Functions 


Discrete-State 

Definition 

Continuous-State 

Definition 

Forward 

Probability 

MO  =P{9n(i),2»> 

<*(x»)  —  p(xn,Z„) 

Backward 

Probability 

/3„(i)  =p{z£|  <?„(*)} 

/3(xn)  =p{Z%\xn) 

Conditional  State 
Probability 

MO  =P{9n(i)|Zw} 

y(xn)  =  p(Xn\Zff) 

Conditional  Joint 
State  Probability 

MM)  =p{?n-l(i),9nO')|Zjv} 

7(x„,xn_i)  =p(x„,x„_i|Zjv) 

This  recursion  can  be  alternatively  defined  by 

«(*»»)  =  P(Zn|Xn)  J  dXn _x  ^(Xn.Xn-i),  (4) 

where  ^(xn,xn_i)  is  defined  as 

^(XnjXn— l)  =  p{  Zn— i,Xn,Xn_i) 

=  KXn|Zn-l,Xn_1)p(Zn_1,Xn_i) 

=  j?(xn|xn_1)a(xn_1).  (5) 

The  recursion  for  a(xn)  can  therefore  be  performed  by  first  computing  and 
£(xn,Xn-i)  and  then  multiplying  by  the  output  density. 

2.1.2  Backward  Densities 

The  backward  densities  are  needed  primarily  as  an  intermediate  step  in  calculat¬ 
ing  the  conditional  densities  7(xn)  s&d  7(xn+i:  x«)-  The  recursion  for  the  backward 
densities  is  given  by 

P(Xn-l)  =  p(Z^_1|xn_i) 

p(xn  ^  J  jP(Zjj  |Xn,  Xjj_x)  ^(Xn,  X,j_x) 

=  y>^Xnl3(zn|Xn)p(Z^|xn)p(xn|xn_i) 

=  y*dxnp(zn|xn)p(xn|xn_1)i3(xn).  (6) 
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Table  3.  Baum  Algorithm  for  DS-HMMs 


Forward  Probability 
(Initialization) 

Oo(«)  =  Xt  for  all  i 

Forward  Probability 
(Recursion) 

&n(j)  —  bj(Zn)  &ij 

i 

Measurement 

Probability 

p(Zn)  =  ]jP  art(i) 
i 

Backward  Probability 
(Initialization) 

0N(j)  =  1  for  all  j 

Backward  Probability 
(Recursion) 

! 

A»(*)  =  Oti^(Zn+l)/3n+x0') 

3 

Conditional  State 
Probability 

7nW  “  p(ZN) 

Conditional  Joint 
State  Probability 

7 n(hj)  —  p(Z/v*)  bj  fen)  &n{j) 

In  this  expression,  /3(xn)  is  undefined  at  the  terminal  time  tw  because  is  empty. 
The  DS-HMM  literature  usually  defines  /5jv(i)  =  1  for  all  j.  In  the  continuous-state 
case,  such  a  definition  is  problematic  because  0(xn)  =  1  is  not  an  integrable  proba¬ 
bility  density.  To  be  formally  correct,  the  recursion  should  be  started  at  time  tjv-i, 
and  /?(x./v-i)  =  /  dx/v  p(z^  |x//)  p(xn  |x^_i)  should  be  defined  as  a  special  case.  This 
approach  is  notationally  inconvenient,  however,  because  other  densities  are  defined  as 
products  of  a  and  /?,  which  would  cause  a  proliferation  of  special  cases.  So,  as  a  purely 
notational  mechanism,  /3(xat)  is  set  to  unity  for  all  xjv*. 

The  backward  density  can  be  alternatively  defined  as 

0(*n-i)  =  J  dxnp(xn\xn-i)  ^(Xr),  (7) 

where 

V’(Xn)  =  p(Z^_1|xn) 

=  p(zn|Z^,xn)p(Z^|xn) 

=  p(zn|Xn)/?(Xn).  .  (8) 

The  backward  recursion  thus  proceeds  by  first  computing  ^(Xn),  then  multiplying  by 
the  transition  density,  and  finally  marginalizing  over  x^. 
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2.1.3  Conditional  State  Densities 


The  conditional  state  probability  densities  characterize  the  stochastic  properties 
of  individual  states  when  conditioned  on  the  observed  measurements.  These  densities 
can  be  maximized  to  determine  the  sequence  of  individually  most  likely  states.  They 
axe  also  important  for  parameter  estimation.  These  densities  are  defined  as 

7(Xn).  =  P(Xn|Z^) 

"  ^p(z"’z”’x“) 

=  ^|^yp(Zn|Zn,xn)p(Zn,xn) 

=  ^TKZn|Xn)P(Zn.Xn) 

=  (9) 

2.1.4  Conditioned  Joint  State  Densities 

Finally,  the  conditional  joint  state  densities  characterize  the  relational  properties 
of  time-adjacent  states  when  conditioned  on  the  measurements.  They  are  defined  as 

7(xn,Xn_i)  =  jKx^Xn-ilZtf) 

1  0 

=  ^^y^Zn-lrZn-^Xn^n-i). 

=  Zt1|Zn^1,XB,XB_i)p(ZB.ilS»,Xit-i) 

=  ^^jP(Zj_1|xn)p(Zw_i,xn,xn_i) 

=  ^|^(xn)<5(xn,X„_i).  (10) 

This  expression  looks  quite  different  from  the  discrete-state  version  because  it  has  been 
derived  directly  in  terms  of  i/>(x„)  and  6(xn,  x„_i).  Substituting  the  definitions  of  ip(xn) 
and  ^(xnjXn-i)  in  terms  of  /3(xr)  and  o^Xr-i)  provides  the  more  familiar-looking  ex¬ 
pression 


7(xn,xn_i) 


^^yp(zn|xn)/3(xn)p(xn|xn_1)  o(x„_i). 


(11) 


2.2  LIKELIHOOD  EVALUATION 

The  measurement  likelihood  can  be  expressed  in  terms  of  the  forward  and  back¬ 
ward  densities  by  writing 


p(ZN)  =  J  dXn  p(Zff,  xn) 

=  J  dXnp(Z^|Zn,  Xn)p(Zn,  Xn) 
=  J  dxnp(Z^\xn)p(Zn,xn) 

=  y*dxn/?(xn)a(xn). 


(12) 


This  result  demonstrates  that  the  definition  given  for  the  conditional  state  density  is 
properly  normalized;  that  is,  /  dxn7(x„)  =  1  for  all  n.  A  simpler  expression  is  obtained 
by  evaluating  the  likelihood  at  tp?,  where  it  is 


i 

p(ZN)  =  JdxNp( ZN,xN)  -  J  dxNa(xN). 


(13) 


This  outcome  is  consistent  with  equation  (12)  since  /3(x.w)  =  1.  The  Baum  algorithm 
and  likelihood  evaluation  formulas  for  CS-HMMs  are  collected  in  table  4. 

2.3  VTTERBI  ALGORITHM 

The  Viterbi  algorithm  is  a  two-pass  dynamic  programming  algorithm  [45]  that 
evaluates  the  maximum  a  posteriori  (MAP)  estimate  of  the  state  sequence,  defined  as 


XN  -  argmaxp(Xjv|Ztf)  =  argmax  p(XN,  ZN), 

Xjv  Xjy 


(14) 


where  p{ Zn)  is  constant  for  any  Zn  and  does  not  affect  the  argmax  operation.  The 
forward  pass  propagates  a  function  <£(x„),  which  is  initialized  as  0(xo)  =  p(xo).  The 
forward  recursion  is  then  defined  for  n  =  1, . . . ,  IV  as 


=  p(Znjx„)  maxjpfXnlXn-i) 


(15) 


This  expression  is  similar  to  Baum’s  forward  density  ct(xn),  except  that  it  contains  a 
maximization  instead  of  a  marginalization. 

The  backward  pass  of  the  Viterbi  algorithm  is  a  back  substitution  operation.  The 
optimal  estimate  at  time  is  determined  as 


xn  =  argmax$(xjv), 


which  is  then  used  to  initialize  the  backward  recursions,  defined  as 

Xn-i  =  aign^x{p(x.,|xn.i)0(x„-1)}. 


(16) 


(17) 
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Table  4-  Baum  Algorithm  for  CS-HMMs 


Forward  Density 
(Initialization) 

a(xo)  =  p(x o) 

Forward  Density 
(Stage  1  Recursion) 

Forward  Density 
(Stage  2  Recursion) 

a(xn)  =  p(znjXn)  J dXn-l  <S(x»,Xn-i) 

Measurement 

Likelihood 

P(Zat)  =  J dx.n  a(xN) 

Backward  Density 
(Initialization) 

/?(xn)  =  1  for  all  xjvr 

Backward  Density 
(Stage  1  Recursion) 

T/>(xn)  -  p(zn\Xn)  0(xn) 

Backward  Density 
(Stage  2  Recursion) 

/3(x„_i)  =  y<ixnp(xn|xn_i)^(xll) 

Conditional  State 
Density 

y(Xn)  =  tfk) *(3S")*(X") 

Conditional  Joint 
State  Density 

Xn— l)  —  p(Ztf)  <^(Xn)^(Xn,Xn“1) 

2.4  PARAMETER  ESTIMATION 

Hidden-state  models  are  natural  candidates  for  the  EM  algorithm,  which  distin¬ 
guishes  three  types  of  data:  the  incomplete  data,  which  in  this  context  are  the  observed 
measurements;  the  missing  data,  which  are  the  hidden  states;  and  the  complete  data, 
which  are  the  concatenation  of  the  incomplete  and  missing  data.  Since  the  joint  density 
in  equation  (1)  is  the  likelihood  of  the  complete  data,  that  function  is  referred  to  here 
as  the  complete-data  likelihood  function  (CDLF). 

For  time-varying  models,  parameter  estimation  using  the  EM  algorithm  requires 
the  use  of  a  multiple-sequence  training  set  since  no  time  averaging  can  be  performed  and 
since  EM  parameter  estimation  with  a  single  piece  of  data  merely  repeats  the  previous 
estimate  at  each  iteration.  Single-sequence  training  can  be  performed  for  time-invariant  ■ 
models  since  the  parameters  can  be  averaged  over  time,  although  classification  models 
so  obtained  will  likely  have  poor  generalization  performance. 
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The  multiple-measurement  training  set  is  denoted  as 

2  =  (18) 

where  Z%k  =  |z),  z,, . . .  is  the  fcth  training  sequence.  The  lengths  of  these  training 
sequences  are  not  constrained  to  be  equal,  although  the  sequences  are  assumed  to  be 
arranged  so  that  Ni  >  iV2  >  •  •  •  Nk.  Even  with  multiple  training  sequences,  however, 
a  difficulty  arises  due  to  the  unequal  lengths  of  the  training  sequences.  Parameters 
in  the  densities  ^(xnlxn-i,  0)  and  p(zn\xn,  0)  corresponding  to  large  n  are  estimated 
from  fewer  and  fewer  training  sequences  as  n  successively  exceeds  the  lengths  of  the 
shorter  measurements.  If  there  is  a  unique  longest  training  sequence,  then  parameters 
corresponding  to  time  samples  that  occur  after  the  end  of  the  second  longest  sequence  are 
“estimated”  from  a  single  measurement.  This  problem  might  be  addressed  by  truncating 
the  longer  measurement  sequences  to  some  predetermined  value  or  by  using  some  type 
of  time- warping  to  obtain  equal-length  measurements,  depending  on  the  application.  In 
what  follows,  the  above  difficultly  is  assumed  to  have  been  dealt  with  in  the  appropriate 
manner. 

Notationally,  the  unequal  sequence  lengths  are  accommodated  by  introducing  the 
variable  Kn,  which  is,  for  each  time  tn,  the  number  of  training  sequences  whose  length 
equals  or  exceeds  n.  This  variable  represents  the  effective  number  of  training  sam¬ 
ples  available  at  tn.  Recalling  that  the  measurements  are  assumed  to  be  arranged  in 
decreasing  order  of  length  and  defining  Nmax  =  ma x(Nk),  then  for  any  function  /(x*), 

K  Kk  -Nmax  Kn 

EE/G4)  =  E  £/(*£)•  (19) 

fc=l  71=1  71=1  fc=l 

The  EM  algorithm  generates  parameter  estimates  iteratively,  where  at  each  iter¬ 
ation  the  estimates  are  chosen  to  maximize  the  conditional  expectation  of  the  CDLF, 
given  the  observed  data  and  the  model  parameters  from  the  previous  algorithm  iteration. 
Denoting  by  X  =  k  =  1, . . . ,  the  collection  of  state  sequences  corresponding 

to  the  measurements  in  the  training  set,  the  CDLF  for  the  training  set  is 

V(Z,X  |S)  =  n?(zk.xkl©) 

b=  1 

K  Nk 

=  IW^o)  IIp(x^|34-i^x)p(4|4>^)-  (20) 

fc= 1  71=1 
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Each  iteration  of  the  EM  algorithm  generates  estimates 

©*+1  =  argrrmx<2(©,  ©*)  (21) 

that  maximize  the  auxiliary  function 

Q(&,  ©*)  =  Ex\z&'{\ogp(Z,X\®)} 

=  j  dXp(X\Z,&)  log  p(Z,  X\@) 

=  II  J  dXitt  P(xwJZ^,  ©*)  log  p(Z,  X\@).  (22) 

£=1J 

The  E-step  evaluates  the  auxiliary  function  at  ©*  from  the  previous  iteration,  yielding  a 
function  of  the  unknown  parameters  only.  The  M-step  then  generates  optimal  estimates 
of  the  unknown  parameters  by  maximizing  the  auxiliary  function.  Since  the  M-step 
requires  differentiation  with  respect  to  the  model  parameters,  it  cannot  be  specified 
without  imposing  a  particular  form  for  the  model  densities.  The  E-step  can  be  specified 
in  greater  detail,  however,  by  imposing  the  structure  of  the  HMM. 

Since  the  CDLF  consists  of  a  product  of  model  densities  that  are  parameterized 
by  separate  subsets  of  the  model  parameters,  the  auxiliary  function  is  decomposed  as 


Q(®,  ©*)  —  Qo(6o,  ©*)  +  Qx(0x,  ©*)  +  Qz{0z,  ©*),  (23) 

where  each  component  corresponds  to  one  of  the  three  model  densities.  The  parameters 
in  the  model’s  initial-state  density  are  estimated  by  maximizing  the  component 

<?o(0o,©r  =  Ex\z,&  jlog  [np(xo^o)J} 

=  II  /  ^k  p(xk  .!Zk » 0l)  {  lo§  p(x£|0o)  | 

K  K 

-  £  II  /  pcxiiiz;,..®*)  iogp(4i«0) 

fc=x  e=i J 

=  £  p&k |Zt,.  ©’)  log 

*=1  J 

=  £  /*S  p(4iz«„  ©■)  log  p(xJ|«o) 

■  br 

=  s/rfx0  7(*o)  logp(Xfl|^o),  '  (24) 

where  y(x§)  =  p(x% jZfa,  ©*)  is  the  conditional  state  density  for  time  to  under  the  old 
parameter  values  in  ©*. 
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The  parameters  in  the  model’s  transition  density  are  obtained  during  each  EM 
iteration  by  maximizing  the  auxiliary-function  component 

'  K  Nk 

log  UIIp«\<-M 

_  fc=i  n=l 

K  r  (KNk  ) 

=  II  / rfx^Ptx^iz^.e*)  Y,  Y  iogp(xJ|4.„«x) 

t=iJ  [mm  J 

X  .  I  Nt  1 

=  Ej^k  p&k iz^.e-)  <  E  logptxiixS.,,^) 

k=l  J  { n=l  J 

KNk 

=  EE  J^kp^klzk’e')l°sp(<\4-M 

k= 1  n=l  J 

K  Nk  .. 

=  EE/fe  *4-i  log  p(x^|x^_2, 6x) 

fc= 1  71=1  ^  J 
■^max  ft 

=  E  E  /*4  /*4-i  7(x^,34^1)iogp(4i4_1,^),  (25) 

71=1  fc=l  ^  J 

where  7(x*,  x*-i)  =  ©*)  is  the  conditional  joint  state  density  obtained 

from  Baum’s  forward-backward  algorithm  with  the  Arth  measurement.  Note  that  one 
of  the  summations  over  the  members  of  the  measurement  training  set  drops  out  in  the 
third  step  of  equation  (25)  because  the  summand  is  zero  except  when  l  —  k. 

Finally,  the  parameters  in  the  output  density  axe  obtained  by  maximizing  the 
auxiliary-function  component 

{’  K  Nk 

log  EUp«\<M 

_k=l  n=l 

K  f  (KNk  1 

=  n  /  JX-Nt  P(X^|Z^,©*)  i  Y1  l°9P{zkn\x^,0z)  > 

^=1  (  fc=l  n=l  J 

K  .  (  Nk  } 

=  EJ^k  p(xj,jz5,„,e*)  E  iogP(zS|xj,«2) 

fe=l  ^  n= 1  J 

Nmax  Kn  r 

=  EE/^  7(xJ)  logp(z*|34,^),  (26) 

n=l  fc=l  J 

where  7(^4)  =  p(x£|Z^,©*)  is  the  conditional  state  density  obtained  from  Baum’s 
algorithm.  The  above  results  axe  summarized  in  table  5. 


Qx(0x,®1)  =  Ex\z&' 
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Table  5.  EM  Auxiliary-Function  Components  for  CS-HMMs 


Transition 

Density 

Parameters 

<?x(*x,e*)= 

n=l  k=l  J  J 

7(xS,4_1)  logp(4l4_1,«x) 

Output 

Density 

Parameters 

Kn 

«*(«*,  e')  =  £  £ 

ft=  1  1  ^ 

f  <&£  7(X^)  log  p(Zn\-)£,8z) 

Prior  State 
Density 
Parameters 

Qo(0o,e')  =  E  /<&o  7(xo)  log  p(xol^o) 

*= i  J 

If  the  continuous  variables  in  this  table  are  replaced  with  their  discrete  counter¬ 
parts  and  the  auxiliary-function  components  are  maximized  over  those  variables,  the 
Baum- Welch  re-estimation  formulas  [9]  are  obtained.  Estimation  of  the'  parameters  in 
the  transition  density  is  simplified  in  this  case  because  the  Oij  enter  into  the  model 
linearly,  subject  to  the  constraint  that  the  “exiting  probabilities”  for  the  zth  state  must 
sum  to  unity.  The  re-estimation  formula  for  Oy  is  therefore  obtained  by  solving  the 
constrained  optimization  problem  whose  Lagrangian  is 

Qx  =  E  E  EI2  7n(*J)  log  fly  +  A  1 1  -  Y,  Oij 

n=l  k= 1  ij  \  j 

A  similar  Lagrangian  is  used  to  obtain  the  output  probabilities  for  discrete  measurement 
spaces. 

2.5  MIXED-MODE  MODELS 

This  section  extends  the  CS-HMM  by  letting  the  initial  state  be  governed  by  the 
mixture  of  densities 

j 

p(x o)  =  ]>>iP(xo|j),  (28) 

i=i 

where  p(xq|j)  is  the  j th  mode  in  the  mixture,  j  is  the  mode-assignment  index,  and 
Pj  —■  P{j)  is  the  mode-assignment  probability,  or  mixing  parameter.  Since  j  is  a  discrete 
variable,  pj  is  a  probability  measure  and  not  a  density.  As  usual,  the  mixture  is  assumed 
to  be  a  convex  combination  with  pj  >  0  for  all  j  and  £/=i  pi  =  1.  For  reasons  that 
will  later  become  clear,  the  resulting  model  is  referred  to  as  a  “mixed-mode”  CS-HMM. 
Due  to  the  commutativity  of  the  summation  and  integration  operations,  the  Baum 
functions  for  mixed-mode  models  all  take  the  form  of  J-component  mixtures,  so  that  a 
mixed-mode  model  acts  like  a  “bank”  of  single-mode  CS-HMMs. 
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2.5.1  Baum  Probability  Densities  and  Likelihood  Evaluation 
The  forward  density  is  given  by 

a(xn)  =  p(  Zn,Xn) 
j 

=  ^2p(Zn,XnJ) 

3=1 

J 

=  Y,P(Zn^\j)p(j) 

3=1 

J 

3=1 

where  r 


(29) 


®jfan)  -r  P(%niXn\j)  (30) 

is  the  forward  density  obtained  using  the  Baum  recursion  with  the  single-mode  prior 
P(xoli).  Tlie  backward  density  /?(xn)  is  the  same  for  all  j  and  is  identical  to  the 
single-mode  case.  Integration  of  the  terminal  forward  density  over  state  space  gives  the 
measurement  likelihood  as  a  mixture  of  single-mode  measurement  likelihoods.  Defining 

p(ZN\j)  =  JdxNaj(xN )  (31) 

results  in  the  likelihood  being  written  as 


p(zn)  =  ^2PjP(zN\j). 

3=1 


The  conditional  mode-assignment  probability 


Pj\N  =  PUIZn)  =  j  P(ZN\j)  Pj 


(32) 


(33) 


is  required  along  with  the  conditional  state  densities  7(x„)  and  7(xn,xn_1)  to  fully 
characteri2e  the  expected  state  evolution.  Given  Pj\N>  the  conditional  state  densities 


are 


7(*n)  =  p(Xn  \ZN) 

J 

=  Sp(Xn,;'|Z^) 
3=1 
J 

3=1 

J 

=  £  Pj\N  7j(Xn)> 
3=1 


(34) 
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where 


7i(Xn)  =  P(xn|Zjv,j) 

=  j^jOjW^W  (35) 

is  the  conditional  state  density  for  a  single-mode  model  with  prior  p(xo|;)-  While 
maximization  of  equation  (34)  to  obtain  the  optimal  state  sequence  is  a  difficult  non¬ 
linear  optimization  problem  even  for  simple  model  densities,  likelihood  evaluation  and 
parameter  estimation  (the  two  crucial  problems  for  classification  applications)  can  be 
performed  directly  from  the  densities.  Finally,  the  joint  state  densities  are 


where 


7(xn,xn_i)  =  p(xn,xn_1  \ZN) 

J 

=  ^p(Xn,X„_  i,j\ZN)  . 

j= 1 
J 

=  SP^Xn-ll^Z^pO'IZiv) 
j=l 
J 

Pi\N  'Yjfeni  ^n-l)> 
i= 1 


(36) 


TjC^Xn-l) 


=  p(Xn,Xn-i\j,  Zn) 


The  expressions  that  are  unique  to  mixed-mode  models  are  summarized  in  table  6. 


2.5.2  Parameter  Estimation 

For  an  observed  measurement  sequence,  knowledge  of  the  mode  assignment  vari¬ 
able  j  would  reduce  the  mixed-mode  modeling  problem  to  a  single-mode  problem.  The 
natural  choice  for  the  missing  data  in  mixed-mode  models  therefore  includes  the  mode 
assignment,  in  addition  to  the  state  sequence.  The  resulting  CDLF  is 

p{Z,X,J)  =  f[p(Zk„t,XkNt,jk) 

fc=l 

K  Nk 

=  IIPjfcP(4b'fc)IIp(34|x^-l)P(Zn|34)> 

fc=l  n—1 

where  jk  is  the  mode  assignment  for  the  fcth  measurement  and  J  represents  the  collec¬ 
tion  of  mode  assignments  for  all  measurements. 
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Table  6.  Baum  Algorithm  for  Mixed-Mode  CS-HMMs 


Forward  Density 
(Stage  1  Recursions) 

=  p(xn|x„_i)cy(xn_i) 

Forward  Density 
(Stage  2  Recursion) 

0,(x„)  =  p(z„|Xn)  J dXn-1  5j(xn,Xn- 1) 

Single-Mode 
Measurement  Likelihood 

p(ZN  \j)  =  J  dXNOij(Xtt) 

Measurement 

Likelihood 

J 

p(ZN)  =  PiP(ZN\j) 

j-1 

Conditional  Mode- Assignment 
Probability 

=  p(ZN)pjP{ZNb) 

Conditional  State 

Density 

7(x„)  =  ^p,|jva,(x„)/3(x„) 

Conditional  Joint 

State  Density 

1  J 

7(x„,Xn-l)  =  TTjjj-T  ^Pi|N^(Xn,X„_i)^(xn) 

'  j-1 

The  EM  auxiliary  function  in  the  mixed-mode  case  is 


0(0,0’)  =  Ex^z,»{^gf(Z,X,J)} 

-  52jdXp(X,J\Z,&)  log  p(2,X,J) 

=  nf/  jx-n.p&n.’H  log  p(z,x,j). 


(38) 


<=1  h 


As  before,  the  auxiliary  function  can  be  decomposed  into  components  that  depend 
exclusively  on  the  different  subsets  of  model  parameters;  that  is, 

<?(©,  ©’)  =  Qj(p,  ©l)  +  <2o(0o,  ©*)  +  Qx(Bx,  ©*)  +  Qz(0z,  ©*).  (39) 


Components  Qx  and  Qz  are  identical  in  form  to  the  single-mode  counterparts  because 
the  relevant  components  from  the  CDLF  (i.e.,  the  product  of  transition  densities  for  Qx 
and  of  output  densities  for  Qz)  are  independent  of  the  mode  assignment.  Summation 
over  je  thus  serves  to  marginalize  the  mode  assignments  from  the  conditional  density 
p(Xjy  ,#|Zjy  ,  ©*).  The  mixed-mode  nature  of  the  model  shows  up  in  the  parameter 
estimates  via  the  conditional  state  densities. 
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The  auxiliary-function  component  Q0  for  the  prior  state  density  is 


-  <2o(0o>©*)  =  EXjr\z&  |  log  |nP(*oli*)  | 

=  EE  f  dxo7 (*o)  logp(4li)- 

J=1  k=l  J 

Component  Qj,  which  is  used  to  optimize  the  mixing  parameters,  is  given  by 
Qj(p,&)  =  Ex,j\z,e*  jlog  |  JJ  pjk  | 


K  J 


=  iie/ iiog 


e=lje=l 
K  J 


{ios  n* } 


=  E  E/ dX.kNkp(XkNkJk \ZkNk,&)  log  pjk 

k=l  ifc=l  J 


=  EE  pUk\zkNk,&)\ogPjk 

k=i  jk= l 
J  K 

S  PiaMv*  Pjk  • 

i=l  Ar=l 


(40) 


(41) 


Since  the  model  is  linear  in  the  mixing  parameters,  the  EM  update  for  these  parameters 
can  be  expressed  without  knowing  the  form  of  the  other  model  densities.  The  updates 
are  obtained  by  maximizing  Qj,  subject  to  the  constraint  that  the  ps  sum  to  one.  The 
Lagrangian  is 


3  K  /  J  \ 

Qj  =  E  E  Pik\Nk  log  P4  +  A  U  “  ,  (42) 

*=ifc=i  \  e=i  J 

where  i  is  a  dummy  version  of  the  mode  assignment.  Differentiating  with  respect  to  pj 
and  equating  the  resulting  derivative  to  zero  yields 

l  k 

PT1  =  tE  Pik\Nk-  (43) 

A  fc=i 

Imposing  the  constraint  E/=1  Pj  =  1  gives  the  Lagrange  multiplier  as  A  =  K.  The 
parameter  update  is  therefore 

1  k 

P f 1  =  1?  E  Ph\Nk-  (44) 

^  k=i 
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3.  HIDDEN  GAUSS-MARKOV  MODELS 
Hidden  Gauss-Markov  models  (HGMMs)  result  when  the  model  densities  in  the 
CS-HMM  take  the  form 

p(Xn|Xn_i,0x)  =  ^(XnjAnXn-^Qn)  (45) 

p(zn|Xn,  $z)  =  N{zn\  BnXn,  R„)  (48) 

p(xo|0o)  =  A/'Cxoj/iojPo) ,  (47) 

where  the  usual  shorthand  A/”(y;  ju,  P)  is  used  to  denote  the  density  function  for  a 

multivariate  normal  vector  y  with  mean  and  covariance  matrix  P.  For  example,  if  y 
is  L-dimensional,  then  the  density  function  is 

Jsf{ y;  H, P)  =  (27r)-L/2  |P|_1/2  exp  |-^(y  -  m)T  P"1  (y  “  A6)}  •  (48) 

The  model  densities  in  an  HGMM  are  parameterized  by  the  sets  Oo  =  {mo>Po}>  = 
{An, Qn})  and  Bz  =  {Bn, R^},  where  n  =  The  transition  matrices  {An}, 

output  matrices  {Bn},  and  covariance  matrices  {Qn}  and  {Rn}  are  collectively  referred 
to  as  the  system  matrices.  Although  time-varying  models  are  considered  throughout 
most  of  this  section  for  the  sake  of  generality,  time-invariant  models  (whose  system 
matrices  are  the  same  for  all  n)  are  discussed  briefly  in  section  3.5,  where  a  parameter- 
invariance  structure  is  noted  for  the  measurement  likelihood  function. 

The  model  densities  that  characterize  HGMMs  provide  alternative  expressions  for 
the  set  of  equations  defined  by 


Xn 

=  An  Xn_l  +  Wn 

W 

Zn 

=  B„Xn  +  Vn 

(50) 

P(w„) 

=  A'(wn;  0,  Qn) 

(51) 

P(Vn) 

=  A/Xvn;  0,Rn) 

52} 

P(x o) 

=  A/"(xo;  Ho,  Po) , 

(53) 

which  is  recognized  as  the  defining  model  for  the  Kalman  filter.  While  Kalman  filters 
are  not  typically  viewed  in  the  context  of  HMMs,  they  have  recently  been  described 
as  being  “analogous”  to  CS-HMMs  [25,27,35].  This  section  demonstrates  that  the 
relationship  is  not  merely  an  analogy,  but  that  Kalman-filter  models  in  fact  form  a 
subset  of  CS-HMMs.  The  CS-HMM  results  given  in  the  previous  section  are  specialized 
to  the  Gaussian  models  in  equations  (45),  (46),  and  (47)  to  show  that  Baum  s  forward- 
backward  algorithm  and  the  Viterbi  algorithm  are  implemented  by  the  two-filter  [42, 43] 
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and  RTS  [41]  formulations  of  the  fixed-interval  Kalman  smoother,  respectively.  The 
measurement  likelihood  obtained  from  the  forward  pass  of  the  Baum  algorithm  is  shown 
to  equal  the  innovation-based  definition  from  Kalman-filter  theory  [44],  and  an  existing 
EM  parameter  estimation  algorithm  [31, 32]  is  shown  to  follow  directly  from  the  CS- 
HMM  auxiliary  function. 

Also,  paralleling  the  developments  of  the  previous  section,  mixed-mode  HGMMs 
are  defined  in  which  the  single-component  prior  density  in  equation  (47)  is  replaced  by 
the  7-component  mixture 

j 

p(xo\90,p)  =  ^(x0;/4,Po)  ,  (54) 

where  the  mixing  parameters  satisfy  pj  >  0  for  all  j  and.^/siPj  =  1-  Here,  90  = 
|/4)Po>  3  =  1j  ••  •  j  contains  the  parameters  for  each  mode  in  the  mixture.  The 
mixed-mode  HGMM  is  developed  to  provide  more  flexible  and  accurate  models  for 
short  measurement  sequences  whose  assessed  likelihoods  are  very  sensitive  to  the  prior 
distribution,  and  to  better  represent  classes  of  signals  whose  members  are  well  modeled 
by  the  same  set  of  system  matrices,  but  which  exhibit  significant  within-class  variability 
due  to  different  initial-state  values. 

3.1  GAUSSIAN  REFACTORIZATION  LEMMA 

Derivation  of  the  HGMM  algorithms  is  simplified  by  introducing  the  following 
Gaussian  refactorization  lemma  (GRL),  whose  proof  is  given  in  appendix  A. 

Lemma:  Given  the  L-dimensional  vector  x,  the  M-dimensional  vector  y,  appropriately 
sized  nonsingular  covariance  matrices  S  and  P,  and  the  M  x  L  matrix  F,  the  product 
function 

.¥*,*)  =  V(y;Fx,S)V(x;ftP) 

can  be  refactored  as 

V(y,  x)  =  A/*(y;  u,  fl)  A7(x;  A,  A) , 

where  the  means  and  covariances  in  the  resulting  product  densities  are 

u  =  Fp 
ft  =  S  +  FPFt 
A  =  (I  —  HF)  p  +  Hy 
A  =  (I-HF)P, 


(55) 


(56) 


(57) 

(58) 

(59) 

(60) 
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with  the  supporting  variable 

H  =  PFTfi-‘.  (61) 

The  parameters  can  also  be  expressed  in  information  terms,  where  the  information 
matrix  is  the  inverse  of  the  covariance  matrix  and  the  information  vector  is  the  mean 
vector  preinultiplied  by  the  information  matrix.  In  this  format,  the  density  parameters 
are 


A-‘A  =  P-V  +  FtS-V  (62) 

A-1  =  p-1  +FTS-1F  (63) 

n-‘o>  =  DP“V  (64) 

fi->  =  (I-DFt)S-,  (65) 

with  the  supporting  variable 

D  =  S-‘F(A-1)-‘.  (66) 


Interestingly,  while  this  lemma  arises  as  a  necessary  prerequisite  for  evaluating  -the 
density  recursions  in  the  Baum  algorithm,  it  naturally  generates  all  of  the  Kaiman- 
filter  update  recursions. 

3.2  BAUM  ALGORITHM 

This  subsection  applies  the  results  summarized  in  table  4  to  models  with  the 
Gaussian  densities  in  equations  (45),  (46),  and  (47).  The  derivations  for  the  forward 
and  backward  recursions  proceed  as  an  induction.  The  conditional  state  densities  and 
likelihood  are  then  obtained  in  terms  of  the  forward  and  backward  densities. 

3.2.1  Forward  Densities 

The  assumed  form  for  the  forward  density  at  time  tn_x  is 

®(^n— l)  1  1>  An— l|n— 1)  Pjj— , 

which  includes  the  initial  condition  by  letting  P0|o  =  Po,  /x0|o  =  fj,0,  and  cq  =  1.  The 
first  stage  of  the  forward  recursion  evaluates  the  term 

<5(xn,xn_i)  =  p(xn|xn_i)a(xn_i) 

=  Cn- 1  Affen',  AnXn_i,  Qn)  A Pn-l|n-l)  • 
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Application  of  the  GRL  results  in 

^(XnjXn— l)  =  /^jj|n-l»Pn|n-l)  A/"(xn_i;  An,  Aft)  ,  (67) 

where  the  mean  and  covariance  parameters  in  the  first  factor  correspond  to  a  Kalman- 
filter  time  update  and  are  given  by 


Pn\n—\  —  Aft^/ft_1|n_1 

(68) 

P«|n-1  =  Qn  +  AftPn_2|n-lA^, 

(69) 

respectively.  The  parameters  in  the  second  factor  are 

—  (I  ~  HjjAft)  fln—ifn—l  +  HnXn 

(70) 

■^n  =  (I  “  HftAft)  Pn_i|n_i, 

(71) 

where 

=  Pn-lln-lA^P^^. 

(72) 

These  variables  are  usually  ignored  in  a  Kalman-filter  context,  but, 

in  an  HGMM  con- 

text,  they  occur  again  in  the  smoothing  and  parameter-estimation  problems.  While  the 

mean  An  of  the  second  term  is  a  function  of  the  current  state  Xn, 
term  over  x„_i  produces  unity  regardless  of  the  mean;  that  is, 

integration  of  this 

/dXft.i^Xft.^^)  =  1. 

The  forward  density  then  becomes 

a(Xn)  =  P(ZnM  J  dXn-l 

—  Cft-iA^ZftjBftXft.Rft)  A7(x„; Pnjn_ 

,)■ 

Applying  the  GRL  to  this  product  yields 

^(^n)  Cn—  i  A/"(zn,  Zft,  Sft)  J\f  ^XftJ  /J^n'n ,  Pn|n^  , 

(73) 

where  the  parameters  in  the  first  factor  are  the  estimated  measurement  and  its  error 

covariance,  respectively  given  by 

=  Bn/Zft|n_i 

(74) 

S„  =  Rft  +  BnPftift.iB^. 

(75) 
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The  parameters  in  the  second  factor  correspond  to  the  Kalman-filter  measurement  up¬ 
date  and  are 


Pn\n  (I  GnBn)  pn\n-\  +  GnZn 

(76) 

Pn|n  =  (I  GnBn)  P n|n— 1, 

(77) 

where 

Gn  =  Pnin.^S;1. 

(78) 

Defining  the  innovation  vector  as 

=  Zn  Zn  —  Zn  Bn  fJ>n\n—l 

(79) 

and  recursively  defining  the  weighting  constant  as 

* 

c*  =  Cn-i  J\f(un;0,'En) 

(80) 

results  in  equation  (73)  being  rewritten  as 

=  Pn|n)  > 

(81) 

which  matches  the  assumed  form  for  the  previous  time  step,  completing  the  induction. 
3.2.2  Likelihood  Evaluation 


The  likelihood  of  a  measurement  sequence  is  obtained  using  equation  (13),  that 
is,  by  integrating  the  forward  density  at  time  over  all  of  state  space,  giving 

p(ZN)  =  J  dxN  a(xjv)  =  cN 

=  n^K;0»sn),  (82) 

n=l 

which  equals  the  known  likelihood  expression  for  the  Kalman  filter  [44].  This  definition 
for  the  likelihood  can  also  be  applied  to  the  partial  sequence  Zn  to  obtain  p  (Zn)  =  Cn. 
Since  the  forward  density  can  be  decomposed  as 

a(xn)  =  p(Zn,Xn) 

=  2?(Zn)p(xn|Zn),  (83) 

it  follows  that 

P(Xn|Zn)  =  Pn\ni  Pn|n^)  •  (84) 
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The  “joint  forward  density”  similarly  decomposes  as 

^(xn,xn_1)  =  p( 'Zn-i^Xn-i) 


*  P(Zn-1)p(xn|Zn_a)p(xn_1|xn,Z„_1), 

where 


(85) 


3*2.3  Backward  Densities 


(86) 

(87) 


The  assumed  form  for  backward  density  is 

PM  =  4')^(x0;<+I,PS,n+i).: 

where  the  superscript  (r)  indicates  reverse  time.  The  recursion  again  proceeds  in  two 
stages,  with  the  first  stage  evaluating  the  product 


i’M  =  P(Zn\Xn)  0(Xn) 

Application  of  the  GRL  produces 

1»(x.)-ef)Ar(z.:*fr),S('))V(^;M»,pW)i 


(88) 


where  the  mean  and  covariance  of  the  first  factor  are  the  reverse-time  measurement 
estimate  and  its  error  covariance,  given  by 

4"  =  (89) 

2M  =  R„+BI>pM+1B*  (90) 

respectively.  The  parameters  in  the  second  factor,  which  correspond  to  a  reverse- time 
Kalman-filter  measurement  update,  are 


where 


G«  = 


ptt 

■^nln+l 


(91) 

(92) - 


(93) 
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Defining  the  weighting-constant  update 

c?!1=o(')|An|-V(z„;z<*),EW)  '  (94) 

yields 

Here,  the  canceling  terms  [A;'|  and  |A»|  have  been  inserted  to  accommodate  a  necessary 
factor  in  the  second  stage  of  the  recursion,  which  evaluates  the  integral 

£(Xn-l)  =  J dXn  p(xn|xn_i)V’(xn) 

=  J  dxn?7(xn,xn_1),  (96) 

where 

rjfa, Xn-i)  =  |  An  |  A^AnX^,  Qn)  fx{^n,  P«)  .  (97) 

This  product  does  not  immediately  fit  the  form  required  for  application  of  the  GRL.  If 
An  is  invertible,  however,  then 

A/'(xn_i;A^1Xn,A^1QnA^T)  =  |An  |  ^(XnjAnXn-ijQn).  (98) 

The  GRL  can  then  be  used  to  obtain 


hfrtnjXn— i)  =  JV(=h,-i;A;1x„,A-1Q„A;T)  AT(x„;AW,pW 


(99) 


The  parameters  in  the  first  factor  correspond  to  a  reverse-time  Kalman-filter  time  up¬ 
date,  given  by 


=  A~Vn|n 

Pn— l|n  =  K1  (Qn  +  Pjj,)  A((T. 
The  parameters  in  the  second  factor  are 

A»  =  (i-HMA^pM  +HMX„_1 
Af)  =  (i_Hh)A;>)pW, 

where 


(100) 

(101) 


(102) 

(103) 


H?’  =  P<'?,(Q„  +  Pi?„)'1A„. 


(104) 
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Substituting  the  refactored  product  into  equation  (96)  and  integrating  results  in 

/(Xn"l}  =  ;  '  (105) 

The  induction  requires  the  assumed  form  to  fit  the  initial  condition,  but  this  is 
not  the  case  since  0(xN)  =  1  is  not  a  Gaussian  density.  As  noted  in  subsection  2  1  2 

the  recursion  should  really  be  started  at  time  tN.x,  where  it  is  evaluated  as  a  special 
case  using 


^(xjv-i)  -  j  dxjj  M{zn- B„xn,  Ra?)  AT(xn-,  Anxn.u  Qn)  .  (106) 

To  avoid  having  to  consider  this  special  case  in  all  of  the  recursions  dependent  on 
p(xj,  an  approach  is  taken  here  that  parallels  the  development  of  the  two-filter  Kalman 
smoother  [42, 43).  That  is,  an  information  formulation  for  the  reverse-time  filter  is  used 
and  the  terminal-state  parameters  are  defined  as  P$£,  =  0  and  „m+l  = 

This  approach  is  equivalent  to  defining  0(xN)  as  a  Gaussian  “pseudodensity”  whose 
variances  are  infinite  but  whose  value  for  any  argument  is  unity.  The  information 

formulation  of  the  reverse-time  filter  is  reflected  in  table  7,  which  summarizes  the  com- 
putations  involved  in  the  Baum  algorithm. 


3.2.4  Conditional  State  Densities:  Method  I 

The  state  density  7W  is  the  normalised  product  of  *(*,)  and  0(x,),  which, 
by  definition,  gives  a  properly  normalised  density.  When  the  Gaussian 

“(X")  and  /3(x")  m  multiPUed  without  the  scale  constants,  the  product  is  a  properly 
normalised  density  function.  The  scale  constants  ft,  and  <£>  can  therefore  be  ignored 
when  constructing  the  7(xn),  giving 


7W  = 

This  is  a  product  of  Gaussians  in  the  same  variable  with  constant  means  and  covariances 
which  has  the  following  well-known  form: 


where 


T(*»>  = 


(107) 


Pmw = (pj+p^;; y1 

hn,N  =  P^,  (p;t  +  pw;;  pW +1) . 


(108) 

(109) 
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Table  7.  Baum  Algorithm  for  HGMMs 


Forward  Stage  1  Recursion 
(Time  Update) 

Given:  A^-iin-i,Pn-i|n-i 

Mnjn-l  —  An/Xn-l)n-l 

P«jn— 1  =  Qn  +  AnPn-l|n— 1  An 

38  P n— ljn— 1  A^P n|n— 1 

Forward  Stage  2  Recursion 
(Measurement  Update) 
Given:  At»|„_i,P„|„_i 

Vn  =  Zft  —  BnMn|n-rl 
“  Rn  +  BnPn|n-lPn 

Gn  =  Pnjn-lBnSn1 

Mn|n  —  (I  —  GnBn)  Mnjn-1  +  GnZn 

Pn|n  =  (I  GnBn)  P n|n— 1 

Measurement 

Likelihood 

N 

P(z)  =  n^»;o,sn) 

n=l 

Backward  Information 
Variables 

c  _  p(r)-1.,W  r  -pW“l 

S«  —  r»  Pn  ) 

Backward  Stage  1  Recursion 
(Measurement  Update) 
Given:  fn|n+i,rn|n+1 

£n|n  35  £n|n*f  1  BnRn  Zn 

Pn{n  5=1  Pn|n+1  “1"  BnRn  ^Bn 

Backward  Stage  2  Recursion 
(Time  Update) 

Given:  £n|n,rn(n 

£n— 1|»  =  AnQn  (Pn|n  *4*  Qn  )  £n[n 

Pn-ljn  =  Aj  Qn1  ^‘Qn1  (Pn|n  +  Qn1)  Qn1J-^.n 

Conditional  State  Calculation 
(Smoothing) 

Given:  ££n|m  Pn|n>£nln+1*  Pnjn+1 

Pn|N  =  (P^|n+r„|n+l) 

Mn|JV  =  Pn|^r  (^>n|n^nln  +  £n|n+l^ 

Adjacent-State 

Cross  Covariance 

Given:  Pn!N,Hn 

P  n,n— 1|7V  =  P 

The  use  of  the  information  format  for  the  backward  density  parameters  simplifies  these 
calculations  since  the  variables  P^n+1  and  P^n+1  ^n+1  are  generated  by  the  backward 
recursions  and  need  not  be  calculated  from  the  covariance  information. 

As  outlined  above,  the  calculations  involved  in  obtaining  these  densities  via  Baum’s 
forward-backward  algorithm  are  exactly  the  same  calculations  involved  in  the  two-filter 
implementation  of  the  fixed-interval  Kalman  smoother  [42,43].  The  two-filter  smooth¬ 
ing  algorithm  is  therefore  an  implementation  of  the  Baum  algorithm  for  HGMMs. 
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3.2.5  Conditional  Joint  State  Densities 

Characterization  of  the  conditional  state  evolution  for  HGMMs  is  completed  by- 
specifying  the  joint  density  of  time-adjacent  states,  which  is  obtained  from  equation 
(10)  by  substituting  equations  (67)  and  (95).  This  calculation  gives 

7(xn,xn_1)  =  M (xn;  An,n-1,  Pn,n-i)  A/fo-u  An,  A»)  A/^;  $n,  Pji)  ,  (110) 

where  the  means  and  covariances  of  these  factors  are  defined  in  equations  (68),  (69), 
(70),  (71),  (91),  and  (92).  It  is  easy  to  show  that  the  product  of  the  first  and  third 
terms  provides  an  alternative  construction  for  7(x*),  such  that 

7(xn,xn_i)  =  ^(Xn-uAn.A*).  ■  (111) 

The  accuracy  of  this  expression  is  confirmed  by  recalling  that  ^/’(xn;/iB|JV,PBjJV)  = 
p(Xn|Z//)  and  that  Af  (xn-i;  \n,  An)  =  p(xn_i|xn,  Zn_i).  In  the  latter  expression,  the 
conditioning  variable  can  be  changed  from  Zn_:  to  ZN  since  the  information  contained 
in  Zg-1  is  redundant,  given  that  conditioning  on  x^  also  occurs,  so  that 

Af(xn-  i;K,K)  =  -p(x*r.i\xn,Zir)-  (112) 

The  product  in  equation  (111)  is  therefore 

7(xn,xn_1)  =  p(xn|Zj\r)p(xn_i|xn,Zjvr) 

=  P(Xn,Xn-l|Zjv),  (113) 

which  is  the  desired  definition.  Equation  (111)  is  instrumental  for  evaluating  the  aux¬ 
iliary  function  for  HGMMs. 

For  the  computations  actually  performed  during  parameter  estimation,  only  the 
cross-covariance  matrix  for  time-adjacent  states,  denoted  Pn,n-i|jv,  is  required.  An 
expression  for  this  matrix  can  be  obtained  by  evaluating  the  conditional  joint  density 
and  then  extracting  the  cross-covariance  matrix  from  one  of  the  off-diagonal  blocks  of 
the  joint  covariance  matrix.  This  derivation  is  provided  in  appendix  B,  where  it  is 
shown  that  the  density  of  the  2Lxl  joint  random  vector  X[njn_:j  =  [x£,x£_1]T  is  given 


by 

where 

7(xn>xn— l)  —  A/’^X[n)n_1];^[nin_1]|jV,P[n)n_1j|^  , 

(114) 

/^[n,n— l]|iV  — 

f*n\N 

.Pn-l|JV. 

(115) 

T>  Pnl*  P„|j\rH£' 

[n,n—  1]  |  N  tt  t>  -p 

L  Xlnlr  n| jv  ^rn-l\N  . 

(116) 
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The  term  Hn  is  defined  in  equation  (72).  The  upper  off-diagonal  gives  the  adjacent-state 
cross-covariance  matrix  as 

P  n,n— 1|JV  =  Pn|ivHj.  ^  C11^) 

This  expression  is  considerably  simpler  than  the  recursive  definition  given  by  Shumway 
and  Stoffer  [31]. 

3.2.6  Conditional  State  Densities:  Method  II 

The  expression  for  the  conditional  joint  state  density  in  equation  (111)  provides 
an  alternative  approach  to  calculating  the  conditional  state  densities.  In  particular,  the 
conditional  state  density  can  be  obtained  by  (1)  calculating  the  forward  densities  a(x„) 
for  n  =  1, . . . ,  N,  (2)  initializing  7(xjv)  =  a(xN),  and  then  (3)  calculating  7(xn-i)  given 
7(xn)  for  n  =  N, . . . ,  1.  To  realize  this  algorithm,  the  backward  recursion  defined  in 
step  (3)  must  be  derived.  This  derivation  begins  by  noting  that 

7(xn_1)  =  y  dxn7(xn,xn_a) 

=  y*  dx^  Af^/Xn^Pnltf)  A/” (Xf* — i  ?  \ n>  5  (118) 

where  the  expression  for  the  joint  density  given  in  equation  (111)  has  been  substituted. 
For  convenience,  the  definitions  of  Xn  and  An  are  restated  as 

An  =  (I  H^An)  fin— i|n— 1 

An  =  (I  H«An)  P n— ljn— 1> 

where  the  intermediate  variable  H„  is 

a,  =  p„-h»-iAJp;iL1- 

Given  these  variable  definitions,  the  mean  in  the  second  factor  in  equation  (118)  is 
seen  to  have  an  “offset”  of  (I  -  HnAn)  /xn-i|n-i>  This  offset  is  temporarily  removed  by 
defining  the  new  variable 

Yn— 1  =  X71— 1  (I  HnA„)  (11®) 

With  this,  the  product  in  equation  (118)  can  be  rewritten  as 

7(xn_i)  =  J  dXn  fin\N i  Pn|iV^  A/"(yn— 1>  An)  .  (120) 

Application  of  the  GRL  then  gives 

y(xn-i)  -  J  dx„  A/'fXn;  u„,T„)  ■ 


(121) 


The  mean  and  covariance  of  the  new  yn_!  term  are 


Un  —  Hra 

=  An+KnPnW&r, 


(122) 

(123) 


respectively.  While  the  mean  t;n  and  covariance  Tn  of  variable  x„  are  easily  obtainable 
they  are  not  needed  because  this  normal  density  integrates  to  unity  regardless  of  then 
values.  Since  the  y„_i  term  is  independent  of  x*,  the  state  density  becomes 


7(Xn-i)  =  Af(y n_nun,ton) 

J^,r(Xn-i;  Mn-l|^-,  Pn_i|^rj  .  (124). 

The  recursions  for  calculating  the  conditional  mean  and  covariance  of  each  state  from 

those  corresponding  to  the  next  later  state  are  respectively  given  by 

f^n—l\N  =  Un  +  (I-  HnAn)/^_1|n_1 
=  +  Mn-lln-1  — 

~  +  Hr  (jln\N  ~  Mn|n-l)  (125) 


Pn-1\N  =  An  +  I^P^H^ 

—  ^  ~  Hn-A-n) 

—  Pn-l|n-l  ~  HnAnP^j^j  +  HrP^jvHJ 

=  Pn— l|n— 1  ~  HnPn|n_1H^  +  HnPn,^H^ 

~  (Pn|lV  —  Pn|n-l)  Hj.  (126) 

These  expressions  equal  the  mean  and  covariance  from  the  RTS  formulation  of  the 
fixed-interval  Kalman  smoother  [41],  This  joint-density  marginalisation  approach  thus 
provides  a  very  natural  way  of  deriving  the  RTS  smoothing  algorithm. 

3.3  VTTERBI  ALGORITHM 


The  forward  passes  of  the  Viterbi  and  Baum  algorithms  differ  only  in  that  the 
V.terbi  algorithm  maximises  over  the  previous  state  at  each  step  whereas  the  Baum 
gonthm  marginalizes  out  the  previous  state.  For  Gaussian  state  densities,  the  differ¬ 
ence  between  marginalization  and  maximization  is  just  a  scale  factor.  Neglecting  this 
scale  factor,  the  Viterbi  forward  density  functions  are 

=  ^(xn;tln\n,Pn\n)  ,  (127) 

where  ^|n  and  Pn|n  are  defined  in  equations  (76)  and  (77),  respectively. 
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The  backward  pass  of  the  Viterbi  algorithm  begins  by  maximizing  the  forward 
density  function  at  the  terminal  time  step,  which  gives  x#  =  Pn\n  by  inspection.  After 
this  notation  is  adopted  for  the  nth  state  estimate  (i.e.,  Xn  =  jUn|jv)>  the  function  that 
is  maximized  during  the  Viterbi  backward  recursion  can  be  written  as 

p(A&n|Jv|Xn-l) l)  =  Af(t*n\Ni  AnXn-j,  Qn)  A/^Xjj-i;  Pn-lln-1)  Pn-l|n-l)  •  (128) 

Here,  the  term  p(zn|xn  =  fj,n\N)  is  neglected  since  it  is  a  constant  that  does  not  affect 
the  argmax  operation;  Applying  the  GRL  and  again  neglecting  a  constant  term  yields 

Xn-iijv  =  argmax{A/'(xn_x;/in-i|N,  An)  }, 

where  An  is  defined  in  equation  (71)  and 

l^n—l\N  =  Pn-l|n-l  +  Hn  (Mnl.W  ~  Pn|n-l)  •  (129) 

In  this  last  expression,  //n|n-i  —  Anpn-i|n-i  and  is  defined  in  equation  (72).  The 
state  estimate  is  x„_i  =  Pn-i\N,  which  equals  the  smoothed  state  estimate  from  the 
RTS  smoother  [41].  Note,  however,  that  An  is  not  the  covariance  matrix  for  the  state 
estimate.  The  Viterbi  algorithm  is,  by  its  nature,  incapable  of  providing  second-order 
statistical  information.  On  another  note,  the  equivalence  of  the  RTS  and  two-filter 
smoothers  implies  the  equivalence  of  the  most  likely  state  sequence  (the  Viterbi  track) 
and  the  sequence  of  individually  most  likely  states  (the  Baum  estimates). 

3.4  PARAMETER  ESTIMATION 

For  HGMMs,  the  E-step  of  the  EM  algorithm  consists  of  evaluating  the  auxiliary- 
function  components  defined  in  equations  (24), (25),  and  (26),  which  requires  the  calcu¬ 
lation  of  expected  values  for  various  quadratic  functions  under  a  normal  density.  These 
expectations  are  evaluated  using  the  following  identity: 

J  dx.  (xTFx  -I-  xTf  +  /o)  A/"(x;  p,P)  =  tr  |F  (p  +  ppT)  }  +  pTf  +  /o>  (130) 

which  is  a  special  case  of  theorem  10.5.1  in  Graybill  [46].  The  M-step  of  the  EM 
algorithm  consists  of  maximizing  the  auxiliary-function  components  obtained  during 
the  El-step,  which  requires  differentiation  of  those  components.  In  addition  to  standard 
matrix  and  trace  derivatives,  these  calculations  require  the  identity  [47] 

J^tr{FTSxFS2}  =  S1FS2  +  S^FSj.  (131) 
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Auxiliary-function  component  Q0  is  given  by 
Qo  =  E  /  7(xo)  log  p(x}) 

=  £ /d*SM'(*o;Poiir.,Piin.)  log ^(x‘;fi0,Po) . 


(132) 


Expanding  the  logarithm  term,  neglecting  the  constant  -Llog(27r)/2,  and  neglecting  a 
scale  factor  of  1/2  results  in  . 


~  Xj/dxo  ^(xo^iJVfc.Poiiv*)  {log ) P0 1 1  ~  (x^-^0)TP01  (xj-^o)j.  (133) 
Performing  the  integration  using  equation  (130)  gives 


Qo  AT  log  |  P0 1 1  J2  tr  {po 1  PoV*  +  (^o|ivfc  -  Mo)  (moi Nk  “  Mo)TJ }  •  (134) 


This  function  is  concave  in  the  parameters  (i0  and  Pq\  so  that  the  optimal  parameter 
estimates  occur  at  the  unique  critical  point.  The  derivative  of  Q0  with  respect  to  no  is 
obtained  as 

=  ~djr0  S.tr  {p°_1  " Mo)  ~  ,o)T} 

K  d  T 

=  ~  S  ^Mo  _  ^°)  Po’1  Mat*  -  Mo) 

K 

~  -E2P01  (135) 

Equating  this  derivative  to  zero  and  solving  for  ji$  gives 

l4+1  =  (136) 

Li  general,  a  symmetry  constraint  must  be  imposed  when  mayim^ino;  Q0  t0  find 
the  optimal  P0.  This  constraint  can  be  implemented  by  using  derivative  formulas  that 
explicitly  take  into  account  the  symmetry  of  the  matrix  or  by  performing  a  constrained 
optimization  in  which  the  appropriate  Lagrangian  is  maximized.  The  constraint  turns 
out  to  be  redundant  for  the  HGMM  covariance  matrices,  however,  because  the  uncon¬ 
strained  optimizer  has  a  symmetric  form.  This  redundancy  occurs  when  optimizing  Qft 
and  as  well.  The  derivative  expressions  used  to  find  the  optimal  covariance  estimates 
axe  therefore  given  for  matrices  with  independent  elements,  which  greatly  simplifies  the 
analysis. 
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The  derivative  of  equation  (134)  with  respect  to  Pq  1  is 

^x  =  —(Ml)  (t£\Nk  Tt*o)  }  •  (137) 

osr0  fc=  1.  '  / 

Equating  to  zero,  substituting  the  optimal  value  of  no,  and  defining 

4  =  Moijv*  —  4+1  (138) 


gives  the  parameter  update 

po+1  =  iE{Pk+44T}. 


(139) 


The  transition-density  component  Qx  was  defined  in  equation  (25)  as 

ffimax  Kn  f  t 

Qx  =  £  £  / I  7(3^, x£_x)  logp(x£|x£_x) . 

n=l  fc=l  J  J 

Substituting  the  definition  of  p(x£|x£_x),  neglecting  the  constant  Llog(27r)/2  and  the 
scale  factor  1/2,  and  marginalizing  7(x£,x£_x)  from  the  log  I Q"1 1  term  yields 


Nzazx  Kn 

Qx  =  53  £  {lo§  I  Q^1  I  -  A(An,  Qn)}  , 

71=1  fc=l 


(140) 


where 


.  4( An,  Qn)  =  Jd^JdX^-l 

x  {  (x£  “  AnX^_x)T  Q;1  (4  -  Anx£_x)  } .  (141) 

The  integral  term  can  be  further  decomposed  as 

4(A„,Q„)  =  4<1,(Q„)+4<!>(A„,Q„)  +  4(S)(A„,Qn)-  (142) 

The  first  component  is  given  by 

4<l>  = 

=  /<&£  i-(x*)**T  CK‘4 

“  tr  {Qn1  (PnjNfc  +  }  •  (143) 


39 


The  second  component  is 

4(2)  =  -2fd4fd4_  i7(x;>4.1){4tQ;1a„xJ.1} 

=  -2 /**/*£*#(£  *(<*&*!) 

-2  fdx£  |W4,Pj|„J  xJT  Q;1  A„  J dxJ.j  Ar(xJ_i;A‘,  Aj)  xJ.j 

=  -2/^^^|w„P^,)xStQ;Ia„^ 

-2/lfi4-V(xS;^, Pj|WJ  xJT Q;1  A„  { (i  —  A„) /4-ijn-i  +H£x£j 

—  _2 /<&£ -V(xJ;;*4  iw».Pj|j»,) 

x  {x„  Q^AjaJxJ  +  xf'<£1An(l-HjA,)p$_1|„_,}.  (144) 

Applying  the  identity  in  equation  (130)  then  gives 

4P>  =  ^trfQ^AnH^  (PR|jvt  +/4|WJll4jAf*)}  ~  (l  —  H^A„)  Pn-lfa  1 

=  -2tr  |Q“lA„  (h;  P^)}  -  2^-A.  +  Hj  -  ^.l)} 

"  -Str{CC*A.(H*i>J|lll)}-2rfJkQ 


'  •  \  ’i  K/  j  *  — /<- r 

=  -2  tr  {q^A*  (P*  n_1|J/+  /4|j\r*  jUn-i|ivfc)T}  • 


The  last  component  is  given  by 

4(3>(An,  Qra)  =  /  /  *4-1  7(4i  *t.l)  {*£-lA*  Q-l  A^} 

—  f  dXn-2  'y(X^-l)x^-iA^Qn1  AnX^_j 

~  tr  An  (^n-ll/4  +  An-l|iVfc  ^n-l|AT*)  } 

tr  {Qn  1  A„  (PLi|^  +  /4-ii.v*  /£*1|JVfc)  Aj} .  (146) 

When  these  integral  terms  are  substituted  back  into  the  expression  for  Qx,  each  trace 
term  imdergoes  a  summation  over  k.  Since  the  parameters  An  and  Qn  axe  constant 
across  k ,  the  summations  can  be  taken  inside  the  trace  and  applied  to  the  terms  involving 
the  state  means  and  covariances.  It  is  therefore  convenient  to  define 


Kn 

CXnx„.  —  {^n|JV*  +  /4|7\rfc/4|.Ar*} 

fc=l 

Kn 

=  E  {pi,n-iiw, + n5p»»^iiw,}  • 


(147) 

(148) 
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With  these  in  hand,  Qx  can  be  written  as 

QX  —  E  j^nlog  Qn  |  “  trjOn1  2AnC^nJ(B_J  + A„CXb_1Xb_1AJ)|  | 

=  E,kios|Q;1|-teK1c-x.}+»{Q;1A„c^.l} 

71=1  V  J 

+  ^r{Qn1Cxnx„_iA^}  —  tr  |Qn1AnCXn_lXn_1Aj|  (149) 

This  last  expression,  which  makes  use  of  the  trace  properties,  is  used  for  optimization 
of  the  covariance  matrix  because  it  ensures  symmetry  in  cases  when  An  is  known  [48]. 
The  derivative  of  Qx  with  respect  to  An  is 

dAn  ~  ~  dAn  (^'X"Xn  ~  ^  ^^XnXn-i  +  AnCXn_lXn_1A^| 

=  2(cX»,.,q;i)t-2Q;1a.,cx,i„_1 

=  2  Qn  ^CXnXn_1  —  AnCXnXn_1  j  .  (150) 

Equating  this  derivative  to  zero  and  solving  for  An  yields  the  update 


■^n+1  —  (151) 

The  derivative  of  Qx  with  respect  to  Q"1  is 
8Qx 

$q-i  KnQn  ~  CXnXn  +  CXjlXn_1An  4-  A„CXnXn_l  —  AnCXn_lXn_1A^,  (152) 

which,  when  equated  to  zero,  gives  the  estimator 

^n+1  =  Jq  {^x»xn  ~  CXnXn_1A^  -  AnCXnXn_l  +  AnCXn_lXn_1Aj| .  (153) 

This  expression  is  used  if  An  is  known.  Otherwise,  the  optimal  value  for  An  is  substi¬ 
tuted  to  obtain  the  update 

<*^l+1  =  ~K^  “  cxnxn_1CXn1_lXn_iC^nXn_i } .  (154) 

Finally,  the  auxihary-function  component  Qz  is 


Nxnax  Kn 


*  'max  “rt  ■ 

Qz  *  £  £  /  d4  7(4)  log  P(4l4) 

n= 1  fc= 

Nm  ax  Kn 

=  E  E  {‘°s 

_ 1 _ i  1  * 


71—1  fc— 1 


(155) 
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where  the  integral  term  is 

A (Bn,  Rn)  =  JdxiM^ii  ^Nk,Pk„tNk)  {(z*-B„:>4)TR;1(zJ-B„xJ)} 

=  Jd*iX{<-,rilKk,Pkn  |%) 

x  {4TBn‘4-2xJTBjR;1zS  +  xJTBjR;1B„4} 

=  zn  Bn1Z*  —  2/4|WfcBnRn1Zn  +  tr  {BjR^Bn  (p*]w*  +  PnlNk  A*n]W*)} 

=  to  {r;1  {z‘ zf  -  2z£#4j,„Bj  +  B„  (PS|N,  +  ^|W,  #4^)  Bl}  ]• 

(166) 


After  bringing  the  summation  over  k  into  the  trace  terms  and  defining 

CznZn  —  53  Zn  2n  (157) 

fc=l 

CZnXn  =  Znf1n\Nky  (158) 

k= i 

Qz  can  be  expressed  as 


Qz  — 


-ATmax  ( 

E  \K« >0? |  K'  I  -  tr  {B;1  (CU -  2 B„C +  BnCn.n.B J' 

«= 1  l 

iVnax  r 

E  AnloglJC1!-  tr{K;1Cw.}+tr{H;1B.C£x ,} 

n=  1  l. 

+  ^r  {l^n1CZnXn®n}  ~  {^n1®nCXnXrlB^| 


(159) 


Because  this  expression  has  the  same  structure  as  Qx,  the  optimal  updates  are  found 
at  the  critical  point  using  the  same  steps  as  for  Qx,  with  the  following  result: 

K+l  =  CfcXnC^  (160) 

Ri+1  =  {Cm.  -  C„n.Bj  -  B„CE,  +  BnC^Bj}  (161) 

=  {C  -  C^C-^C^} .  (162) 

The  correlation  matrix  and  model  parameter  estimates  are  shown  in  table  8.  In  this 
table,  the  factors  1/Kn  on  the  correlation  matrices  are  borrowed  from  QJ+1  and  R4+1- 
The  factors  cancel  in  A4+1  and  Bj*1. 
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Table  8.  EM  Parameter  Estimators  for  HGMMs 


Correlation 

Matrices 

1  Kn  (  T  1 

C*n»n  =  ^2  {P»IW<=  +  ^1** 

Kn 

Cxnxn_!  =  ^2  |Pn,n-l!Wfc  +  An| Nk 

,  Kn 

n  1  V'  _*  _fcT 

Wn*n  —  ~rp~  7,  zn  zn 

1  k  fcT 

CZnxn  —  ~t7~  /  J  znV“n\Nk 

71  k=l 

System 

Matrices 

A^+1  = 

pt+1  __  p  p“l 

“  WnXn^XnXn 

pi+1  _  p  p  p-1  pT 

Hn  -  ^xnxn  -^XnXft-i^^Xn-i^XnXn-i 

■pi+1  ^  p  p-i  p-1  pT 

•tt-n  —  "  ^nXn^XnXn^ZflXn 

Initial-State 

Parameters 

/4+1  = 

1 

£o  -  Mo|Jvfc  “  Mo+1 

**o+i -jy:  {poiArfc + £§ } 

fc=l 

The  auxiliary-function  components  Qo,  Qx,  and  Qz  are  concave  in  parameter  sets 
|(u0,Po1},  {An,  Q”1},  and  {Bn,!^1},  respectively.  The  updates  at  each  iteration  are 
therefore  the  unique  maxima  of  the  CDLF.  The  final  measurement  likelihood  p(Zjv) 
is  not  necessarily  concave,  however,  so  that  suboptimal  local  maxima  are  possible. 
Multiple  training  runs  from  different  starting  points  may  therefore  be  needed  to  find 
the  global  maximum. 
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3.5  TIME-DWARIANT  MODELS 

The  parameter  update  formulas  given  above  are  easily  specialized  to  time-invariant 
HGMMs  by  performing  a  second  averaging  operation  across  time  when  the  correlation 
matrices  are  calculated  in  equations  (147),  (148),  (157),  and  (158).  These  time-averaged 
correlation  matrices  are  then  used  in  equations  (151),  (154),  (160),  and  (162),  which 
are  each  evaluated  only  once  for  all  time. 

The  measurement  likelihood  in  the  time-invariant  case  has  a  parameter  invari¬ 
ance  structure  that  is  worth  noting.  Specifically,  as  an  argument  of  the  measurement 
likelihood  function  in  equation  (82),  the  parameter  set 


©  - '  {A,  B,  Q,  R,  fx0,  P0}  (163) 

is  equivalent  to  any  set 

®  =  {A,Bj  Q,R, /2ojPo|  >  (164) 

where  the  transformed  parameters  are  given  by 

A  =  UaAU^1  (165) 

_B  =  BU0-1UI1  (1 66) 

Q  =  U^UoQUjU^  (167) 

fa  =  U^Uo/io  (168) 

Po  =  LUUoPoUjU^.  (169) 


Here,  is  any  nonsingular  Lx  L  matrix,  and  U0  is  any  nonsingular  Lx  L  matrix 
that  commutes  with  A. 

The  equivalence  of  the  two  models  under  the  likelihood  function  is  demonstrated 
in  appendix  C.  Two  conclusions  can  be  drawn  concerning  this  invariance.  First,  the 
EM  algorithm  will  converge  to  the  member  of  the  invariant  family  that  is  closest  to  the 
starting  point.  Second,  any  member  of  the  invariant  family  is  theoretically  as  good  as 
any  other  for  classification.  While  the  second  conclusion  suggests  that  there  is  no  need 
to  be  concerned  about  the  first,  it  may  be  desirable  for  numerical  reasons  to  constrain 
the  EM  algorithm  to  produce  estimates  of  a  given  structure.  For  example,  it  might  be 
beneficial  for  the  transition  matrix  to  be  as  close  as  possible  to  an  identity  matrix. 
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3.6  MIXED-MODE  MODELS 

Attention  is  now  turned  to  mixed-mode  HGMMs  (MM-HGMMs).  The  Baum 
forward  densities  for  these  models  axe  obtained  directly  by  substituting  an  indexed 
version  of  equation  (81)  into  equation  (29),  giving 

«(*»)  =  E^-4^(xn;/4|n?Pi|n)-  (l70) 

3= 1 

The  c£,  /4|n>  P„|n  axe  calculated  for  each  j  using  the  HGMM  recursions  with 
the  single-mode  prior  p(xo|j).  Furthermore,  since  the  integral  of  the  sum  of  terminal 
densities  is  just  the  sum  of  the  integrals  of  the  individual  densities,  the  likelihood  can 
be  written  as 

P(z^)  =  EPiKZiV|i)  =  EPi4.  (i7i) 

i=i  i= i 

Drawing  on  equations  (107)  and  (34),  the  conditional  state  densities  axe 

7(Xn)  =  (172) 

3-  1 

where  fjc^N  is  the  conditional  state  mean  and  is  the  state  covariance  matrix  from 
the  appropriate  single-mode  HGMM. 

Parameter  estimates  in  MM-HGMMs  are  obtained  using  results  from  section  2  and 
the  analysis  techniques  of  the  previous  subsection.  The  mean  and  covariance  updates 
for  the  mixture  components  in  the  initial-state  prior  distribution  axe  given  by 


tis  ~  K  Eftm/4ik 

K3  fe=l 

(173) 

Ki  k= l 

(174) 

where 

K 

k3  ~  E  Pj\Nk  • 

fc=i 

(175) 

The  variable  e^,t+1  denotes  the  difference  between  the  estimate  of  the  mean  at  time  to, 

conditioned  on  the  kth  measurement  sequence,  and  the  weighted  sum 
from  all  measurement  sequences;  that  is, 

of  the  estimates 

-jfe.i+l  _  ,Jk  _  i+ 1 
*0  —  Mo| Nk  • 

(176) 
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The  estimates  for  the  mixing  parameters  were  defined  in  section  2.  The  expressions  for 
the  system-matrix  estimators  in  terms  of  correlation  matrices  are  identical  to  those  given 
in  equations  (151),  (154),  (160),  and  (162)  for  single-mode  HGMMs.  The  correlation 
matnces  for  MM-HGMMs  are  very  similar  to  those  for  single-mode  HGMMs,  but  with 
a  weighted  sum  over  the  mode  assignments;  that  is, 

j 

CXnXn  =  X/bW  CLxn  (177) 

j 

Cxnxn_!  =  EteCL.!  (178) 

3= 1  ' 

J 

CZnXn  =  JCPjliV*  CinXn,  (179) 

where  C£nXn,  CinXn_i,  and  C£nXn  are  given  by  equations  (147),  (148),  and  (158),  respec¬ 
tively,  except  that  the  fi*{Nk  and  are  indexed  by  j.  The  measurement  correlation 
matrix  is  identical  to  that  given  in  equation  (157)  since  the  measurements  do  not  depend 
on  the  mode  index.  • 

While  the  number  of  components,  J,  in  the  mixture  density  has  been  assumed 
to  be  known  thus  far,  it  might  be  estimated  as  follows.  First,  the  system  matrices  for 
a  single-mode  HGMM  with  a  fixed  large-variance  Gaussian  prior  are  estimated  using 
the  measurement  training  sequences.  Second,  the  estimated  parameters  are  used  to 
calculate  the  conditional  state  sequence  corresponding  to  each  measurement  sequence. 
Finally,  a  multivariate  clustering  algorithm  is  applied  to  the  initial  states  from  these 
state-sequence  estimates.  The  number  of  significant  clusters  provides  an  estimate  for 
J,  and  the  location  and  spread  of  the  clusters  provide  initial  estimates  of  and  P0j, 
respectively,  for  each  mixture  component. 
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4.  SUMMARY 

A  general  theory  of  continuous-state  hidden  Markov  models  (CS-HMMs)  has  been 
presented.  The  given  results  solve  the  likelihood  evaluation,  state  estimation,  and  pa¬ 
rameter  estimation  problems,  to  the  extent  that  the  solutions  can  be  formulated  inde¬ 
pendent  of  the  particular  form  of  the  model  densities.  The  CS-HMM  results  were  then 
specialized  to  linear  Gaussian  models,  resulting  in  the  hidden  Gauss-Markov  model 
(HGMM).  A  Gaussian  refactorization  lemma  has  been  derived,  which  provides  a  nec¬ 
essary  tool  for  evaluation  of  the  Baum  recursions  for  HGMMs  and,  at  the  same  time, 
naturally  generates  the  update  recursions  for  Kalman  filters  and  smoothers.  The  Baum 
and  Viterbi  algorithms  for  HGMMs  were  shown  to  be  equivalent  to  two  different  im¬ 
plementations  of  the  fixed-interval  Kalman-smoother.  It  was  shown  that  the  likelihood 
obtained  using  the  Baum  algorithm  for  HGMMs  equals  the  classical  likelihood  definition 
from  Kalman-filter  theory  and  that  the  parameter  estimation  algorithm  for  these  mod¬ 
els  is  equivalent  to  the  existing  expectation-maximization  (EM)  algorithm  for  Kalman- 
filter  models.  Taken  together,  these  results  unify  Kalman-filter  and  HMM  theory.  The 
parameter-estimation  algorithms  given  in  this  report  were  formulated  for  multiple  train¬ 
ing  sequences  with  unequal  lengths.  The  HGMM  training  algorithm  presented  here 
therefore  extends  previous  algorithms  that  treat  equal-length  training  measurements. 
This  analysis  has  also  resulted  in  a  new  estimator  for  the  cross  covariance  between  adja¬ 
cent  HGMM  states  that  is  considerably  simpler  than  existing  estimators.  A  parameter 
invariance  structure  was.  demonstrated  for  HGMMs  whose  parameters  do  not  .vary  with 
time..  Finally,  the  CS-HMM  and  HGMM  algorithms  were  extended  for  models  whose 
initial  state  is  governed  by  mixtures  of  densities  instead  of  a  single  density. 

This  work  paves  the  way  for  extensions  of  HMM  and  Kalman-filter  algorithms  in 
both  classification  and  tracking  applications.  For  example,  it  generates  a  framework  for 
investigating  analogs  of  Kalman  filters  and  smoothers  for  non-Gaussian  model  densities. 
In  addition,  the  parameter-estimation  algorithm  could  be  used  to  obtain  a  more  sophis¬ 
ticated,  and  possibly  more  accurate,  tracking  algorithm  whose  state  and  measurement 
covariance  matrices  are  influenced  by  the  observed  data  instead  of  being  predetermined 
from  the  prior  estimates  of  the  model  parameters. 
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APPENDIX  A 

PROOF  OF  THE  REFACTORIZATION  LEMMA 

This  appendix  demonstrates  the  equivalence  of  the  two  product  functions 

'  77(y,x)  =  M (y;  Fx,  S)  A/^x;  //,  P)  (180) 

=  W(y;w,0).Af(x;A,A),-  .  (181) 

where  u,  f2,  A,  and  A  are  defined  in  terms  of  F,  S,  \i,  and  P,  as  given  in  section  3.1. 
Essentially,  this  refactorization  is  equivalent  to  saying  that  p(y|x)p(x)  =  p(x|y)p(y). 
Beginning  with  equation  (180)  and  substituting  the  normal  densities  gives 

*7(y»x)  =  (27r)“(i+M)/2|s|"1/2|p|"1/2exp|-|^(y,x)J.  (182) 

A  useful  form  for  the  exponent  £(y,x)  is  obtained  as 

f(y>x)  =  (y  —  Fx)t  S-1  (y  —  Fx)  +  (x  —  fx)r  P-1  (x  —  fx) 

=  xT  (P"1  +  FTS-1F)  x  -  2xt  (P-V  +  E^S-V)  +  yT  S'1  y  +  f  P"1 » 

=  xTA_1x-2xTA-1A  +  yTS-1y +  /zTP“V 

=  (x- A)TA_1(x- A) -ATA_1A  +  yTS~1y  +  ^TP-1iu.  (183) 

The  variables  A  and  A,  which  are  introduced  in  the  third  step  of  equation  (183)  and 
axe  designed  to  facilitate  completing  the  square  in  the  last  step,  are  defined  by 

A'1  =  P-1  +  FtS~1F  (184) 

A-1  A  =  P-V  +  FTS-Xy.  (185) 

These  variables  naturally  arise  in  the  “information”  form,  where  the  inverse  covariance 
matrix  is  the  primary  variable  instead  of  the  covariance  matrix  and  the  mean  is  scaled 
by  the  inverse  covariance.  The  variable  defined  in  equation  (184)  is  called  the  informa¬ 
tion  matrix,. and  the  variable  defined  in  equation  (185)  is  called  the  information  vector. 
Equations  (184)  and  (185)  constitute  a  measurement  update  in  the  information  formu¬ 
lation  of  a  Kalman  filter  [5].  If  the  covariance  matrix  is  desired,  the  matrix  inversion 
lemma  (MIL)  [46,49]  can  be  applied  to  obtain 

A  =  P-PFt(S  +  FPFt)_1FP.  (186) 


49 


Defining  the  variables 


s  =  s+fpft' 

H  =  PF1^"1 

results  in  the  covariance  matrix  becoming 

A  =  P-PFtE"1FP 
=  (I-HF)P. 

The  mean  vector  is  then  obtained  as 

A  =  A(P~V  +  FTS-1y) 

=  (I  —  HF)  /j  +  (I  —  hf)  PFTS-1y 
=  (I  -  HF)  fi  +  (i  -  PFtE-1f)  PF^^y 
=  (I  ~  HF)  fi  +  PFt  (i  -  E-1FPFT)  S-1y 
=  (I  -  HF)  fi  +  (PFTE-1)  (e  -  FPFt)  S-1y 
=  (I  —  HF)  /i  +  HSS_1y 
=  (I  —  HF)  fi  +  Hy. 


(187) 

(188) 


(189) 


(190) 


Returning  to  the  exponent  term  in  equation  (183)  and  introducing  the  functional 

C(y)  =  yTS-1y  +  /iTP'V-ATA-1A  (i9i) 

causes  the  exponent  to  become 

£(y,x)  =  (x  “  A)T  A-1  (x  -  A)  +  C(y),  (192) 

which  is  put  in  a  form  similar  to  equation  (183)  as 

C(y)  =  yT  S-1  y  +  P'1  /t  -  (p- V  +  PTS-1y)  T  A  (p- V  +  F^-'y) 

=  yT(s-1-S-1FAFTS-1)y-2yTS-1FAP-V  +  ^(p-t_p-iAp-.N 
=  yT  n-1  y  -  2yTn-^  +  (p-  _  p-1  AP-1)  A  ’ 

=  (y-u-)T£l-‘(y_1J)-wTn-‘a,  +  ^(p-t-p-tAp-l)M.  (193) 

The  variables  u>  and  ft  introduced  above  are  defined  by 

f2~i  =  S"1  -  S-1FAFtS-1  (194) 

ft-1*  =  S-’FAP-V  .  (195) 


These  terms  can  be  simplified  by  introducing  the  variable 


D  =  S-1FA 


=  S-F(p-‘  +  F^S->F)-‘, 


with  which  the  information  matrix  becomes 


n-1  =  (i  -  s-1faft)  s-1 

=  (i-DF^S"1 


(197) 


and  the  information  vector  becomes 


=  DP  V .  (198) 

Equation  (194)  can  also  be  converted  to  covariance  form  by  applying  the  MIL  in  reverse; 
that  is. 


n  =  |s-1-s-1f(p-1  +  fts-1f)'1fts-1|  1 
=  s  +  fpft  =  s. 


(199) 


Noting  that  O  -  E  and  applying  the  MIL  allows  the  mean  vector  to  be  obtained  as 

■  w  =  os-1f(p-1+fts-1f)“1p“V 

=  ES_1F  (P  -  PFtS_1FP)  P"V 
=  ES-1  (i  -  FPFtE-1)  fpp~v 
=  SS"1  (E  -  FPFT)  E"^ 

=  SS-1SE_1F^ 


= 

The  functional  £(y)  is  now  written  as 


C(y)  =  (y-w)Tfi_1(y-u;)  +  Ac} 


where 


(200) 


(201) 


=  /iT  (P-1  —  P-1AP-1)  fj,  —  u/T  f2-1  u 
=  f  (P_1  -  P-1AP~1  -  FTE-1F)  h 
=  HT  {p-1  -  p-1  (p  -  pfte_1fp)  p-1  -  FtE“1f}  , 
=  /1T  (p_1  -  p-1  +  FTE"1F  -  FtE"1f)  ji 
=  0. 


(202) 
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The  combined  exponent  from  equation  (180)  therefore  becomes 

£(y,x)  =  (y  —  cj)Tfl~1  (y  -  w)  +  (x  ^  A)T  A-1  (x  —  A). 

Substituting  this  expression  for  f(y,x)  into  equation  (182)  gives 

>7(y,x)  =  (2.)«|s|-v>rexp|_l(y_ij)TfJ.1(y_w)} 

x  exp|--(x-A)TA-'(x-A)}.  (203) 

The  exponentials  can  be  written  as  normal  densities  if  constants  are  included  to  com- 
pensate  for  the  normalizing  constants;  that  is, 

b(y,x)  =  (2x)-(^|s|-I/>r1%)«/=|fi|VV(y;W,n) 

x(2^|A|1/V(x;A,A).  (204) 

Multiplying  out  the  2ir  terms  and  collecting  the  determinant  terms  gives 

*?(y,x)  =  cj\f(y;u>,  ft)  M (x;  A,  A) ,  (205) 


where 


C  ~  {lSl  '|P|  1  *  |«|  *  |A|}1/2.  (206) 

Equation  (205)  is  the  desired  expression  if  it  can  be  shown  that  c  =  1.  This  equality 

is  demonstrated  by  using  a  theorem  for  block  matrices  [46,49],  which  states  that  the 
determinant  of  the  matrix 

t>  _  fBn  B12 

~  [b»  B»J  (207) 

with  nonsingular  Bu  and  B22  is  given  by 


(207) 


lBl  - 

• 

B22  —  B21B1-11B12 

= 

B22 

* 

B11  -  B12BJ2  B2i 

First,  the  definitions  of  ft  and  A  and  equation  (209)  axe  noted  to  obtain 

f  ft  |  *  I A I  =  E  M  P  -  PFtE-1FP  I  =  det  [  P  PFT 
11  1  FP  E  ' 


I  ft  I  *  |  A  |  =  |  S  |  •  I P  -  PFtE-1FP  |  =  det 
It  is  also  noted  that 

Ip-1  Ms-1!  =  det[p_1  0 

0  S"1 


(208) 

(209) 


(210) 
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(211) 


(212) 


Given  these  two  partitioned  matrices,  the  desired  product  is 

s  |-1  - 1  p  |-1  •  |  ft  |  •  |  a  j  =  j  r  |, 

where 

jP"1  °][p  PFTj  _  T  I  FT  ‘ 

~  [  0  S"1]  [FP  S  J  “  S-1FP  S_1E_ 

Application  of  equation  (208)  then  yields 

r|  =  1 1  -  j  s-1s  —  s-1fpift  | 

=  s-1(s-fpft)| 

=  S^S  1  =  1, 

which  completes  the  proof. 


(213) 


(214) 
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APPENDIX  B 

HGMM  JOINT  STATE  DENSITY 

This  appendix  derives  the  conditional  joint  density  of  time-adjacent  HGMM  states, 
as  given  in  equations  (114)  through  (116).  The  starting  point  for  this  derivation  is 
equation  (111),  which  states  that 

l)  =  /^n|iVj  ./V(Xn— lj  An,  An)  ,  (215) 

\ 

where 

^ n  =  (I  HnAft)  /in— 1 |n— 1  4*  HnXn 
A n  =  (I  HnAn)  Pn— 1|ti— 1 

It  =  Pn-^KKfn-V 

Substituting  the  functional  forms  of  the  normal  densities  gives 

7(xn,xn_1)  =  (27r)-i  |  Pn|7^  |  1/2|a„|  ^expl-I^XnjXn-Oj,  (216) 

where 

£(^n?Xri“l)  =  Hn\N^  Pn|JV  fbt\N )  4*  (Xn_  \  An)  An  (xn_i  An).  (217) 

Defining  the  conditional  state  error  vector  en  =  Xn  —  and  recalling  that 

f^n\n—l  =  An  /in— 1  |n— 1 
Mn— 1| N  =  Mn—l|n— 1  4  Hn  (^/in|iV  /in|n— l) 

allows  An  to  be  rewritten  as 

A^n— l|n— 1  4  Hn  ^Xn  An  /^n— l|n— l) 

=  Mn— l|n— 1  4  Hn  (/4i|JV“  Mnjn— 1^  4  Hn  Sn 

=  /4i— 1|N  4  Hn  6n  .  (218) 

Defining  the  error  vector 

®n— 1  =  Xn— l  An  —  £n— 1  Hn  En  (219) 

then  results  in  £(xn,Xn-i)  being  expressed  as 

£(Xto5  Xn— 3.)  =  SnTPn|^n4en-iTA’ien.i 

“  Pn|iV  4  (£n— 1  Hn  En)  A  (sn— 1  Hn  En) 

=  e»T  (p;‘w  +  Hi A-*H„)  -  ej., A-'H„  £„ 

-£lHjA;1£„.1  +  e„.,TA-1£„.,.  (220) 
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Equation  (220)  is  written  in  matrix  notation  as 


^(Xn,  Xn— j)  = 

‘  £n  ' 

T 

'p;ik + hja-‘h„ 

-HlK1' 

’  £n  ' 

.  £n-l . 

1 

£ 

b~ 

F 

K1  . 

.£n-l. 

The  joint  density  can  then  be  written  as 


7(Xn,X«-l)  —  ^V’(x[n)n-l];^[n,n-l]|//,P[n,n-l]|//)  , 
where  the  joint  state  variable  is 

:  x[n,n-l]  = 

and  the  mean  and  covariance  are 


Xn- 

Xn-l 


1]|JV 


f^n\N 


P[n,7t— 1]|J\T 


p-i 


nl«  +  -HJA;1 

-AJ'H.  A;1 


-1 


(221) 

(222) 

(223) 

(224) 

(225) 


respectively.  There  axe  no  scale  factors  in  equation  (222)  because,  using  an  argument 
similar  to  that  given  in  appendix  A,  it  can  be  shown  that 


Pn|JV  '  An 


p-i 

lJ|iV 


(226) 


A  more  useful  expression  for  the  joint  covariance  matrix  is  obtained  by  applying 
the  block  matrix  inversion  theorem  [49].  This  theorem  states  that  the  inverse  of  block 
matrix  S  with  nonsingular  diagonal  blocks  Su  and  S22  is 


where 


(227) 


Tn  —  Sn*2 
^12  =  — S111Si2S22'[i 

t2i  =  -s2-21s21sr112 

T22  =  S22[i  ■ 

Here,  Su.2  and  S22-i  are  the  Schur  complements 

Su.2  =  Sn  —  S21SJ21S12 
Saw  =  S22  —  Si2Sj"11S2x. 


(228) 

(229) 

(230) 

(231) 


(232) 
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The  structure  of  the  joint  covariance  matrix  dictates  that  the  diagonal  blocks  should 
equal  the  covariance  matrices  Pn|w  and  Pn-i|N-  This  outcome  is  confirmed  by  substi- 
tuting  Sn  =  -t-HjA^H,,  Su  =  -HjA’1,  S21  =  A^H„,  and  Sk  =  A;\  so 
that  the  Schur  complements  are 

Su-2  =  (P;,V  +  - HJA-'AA-'H,)"1  =  P;>,  (233) 

Saw  =  A-1-A-1(P„,„+HjA-1H„)"JHjA-1  =  (A  +  HUP^H*)-1 .  (234) 

With  these  results,  the  first  diagonal  block  of  the  joint  covariance  matrix  is  Tn  =  Pn|Ar, 
as  it  should  be.  The  second  diagonal  block  is 

Tk  =  A  +  HJPniwHj 

=  (I  HnAn)  Pn-i|n-l  H“  HnPn|jvHn 

—  P-  n-l|n-l  —  HnAnPn-i\n-i  +  HnPn|^H^ 

”  Pn-l|n-l  “  HnPn|n-lPn|n-i AnP 1|»-1  +  HnP 

■=  P,-n„-i-H.,P„l„-iHj  +  H..Pn|wHj 

=  Pn— l|n— 1  —  Hn  ^Pn|AT  Pn|n— l)  Hj,  (235) 

which  is  just  the  definition  of  Pn_i|jv  from  the  RTS  smoother.'  The  lower  off-diagonal 
is  given  by 

T21  =  A„  Pn|jy  =  HnPn|tf.  (236) 

Since  the  inverse  of  a  symmetric  matrix  must  also  be  symmetric,  the  cross-covariance 
matrix  is  known  to  be 

Pn,n— 1|AT  =  T12  =  T^j  =  Pn|jvH^.  (237) 

To  verify  the  validity  of  this  expression,  T12  is  evaluated  from  the  terms  in  the  block 
matrix  inverse,  giving  the  expected  result  as 

T12  =  (P^  +  H^A^Hn)"1  hJA^P*.^ 

=  [PnlN-Pn^H^A  +  HnP^H^-'HnP^jH^A;1?^!^ 

=  Pn|NH?  (I  -  P^HnP^)  .  . 

“  Pn|jyH^  [I  —  Pn^.j|iv  (Pn-l|JV  ~  An)  ]  A^1Pn_i|^- 
=  Pn|7vH^  (1  —  I  +  Pn^i|jyAn^  An  1P n-l|JV 
=  P^.vHj. 
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APPENDIX  C 

HGMM  PARAMETER  INVARIANCE  PROPERTY 

The  objective  is  to  show  that  the  model  with  time-invariant  parameters 


.  ©  =  {A,B,  Q,  R,/io,Po}  (238) 

is  equivalent  under  the  measurement  likelihood  function  to. parameter  set 

©  =  {A,  B,  Q,  R,  JIq,  Po}  ,  (239) 

where  the  transformed  parameters  satisfy 

A  =  UaAU^  (240) 

B  =  BUo1U21  (241) 

Q  =  U^UoQUjU*  (242) 

pto  =  U.4  Uo  /^o  .  (243) 

P0  =  U^UoPqUJU^.  (244) 


Here,  U*  can  be  any  nonsingular  Lx  L  matrix  and  U0  can  be  any  nonsingular  Lx  L 
matrix  that  commutes  with  A. 

The  invariance  of  the  likelihood  is  demonstrated  by  showing  the  invariance  of  the 
measurement  innovation  un  and  measurement-error  covariance  S„.  Note  that  the  state 
variables  are  not  invariant  to  the  parameter  transformations  and  that  they  undergo  a 
corresponding  set  of  transformations  defined  by 

Mn|n— 1  =  Uo  Hn\n—i 
Mn|n  =  Uj4  Uo  /^n|n 
P„|n-1  =  U^UoP^-iUJUJ 

P„!»  =  U.,UoPn|,,UjUj. 


First,  it  is  shown  that  output  calculations  with  the  transformed  parameters  and  state 
variables  produce  the  same  measurement  innovation  and  error  covariance  as  the  original 
model.  It  is  then  shown  that  the  Baum  recursions  actually  propagate  these  variables. 
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The  measurement  innovation  is  given  in  terms  of  the  transformed  states  and  model 
parameters  as 

un  —  2n  —  B  Jln\n-l 

~  “BU^U^UxUo/inln-l 

=  2n  B  ^tn|n— 1 

(245) 

and  the  measurement-error  covariance  matrix  is 

Sn  =  R  +  BPn|n-lBT 

=  R  +  Uq  1  U0  Pnjn.jUo  U^U^T  Uq  T  Bt 
=  R  +  BP^B* 

=  (246) 

Thus,  if  it  is  assumed  that  the  transformed  model  propagates  the  transformed  state 
variables  in  the  forward  probability  recursions,  the  invariance  is  demonstrated.  The 
propagation  of  these  variables  is  demonstrated  by  induction.  The  transformed  variables 
fit  the  initial  state  conditions  by  definition,  so  that  only  the  recursions  need  be  examined 
For  the  time  update,  it  is  observed  that 


Mn|n-1  —  An  /Xn-l|n— 1 

=  U^AU^U^Uo/ln-Hn-! 
=  Uj4  A  Uq  i|n_i 
=  Ux  Uo  A  //n_i|n_i 

=  Uj4  Uo  /^n|n— 1- 

The  recursion  for  the  transformed  covariance  matrix  is 


Pn|n-1  —  Q  +  APn_1jn_1AT 

=  U^UoQUjUT  +  U^A U^'U,, Uo P„ Uj Uj  (U^ A U^)T 
-  U^U0  (Q  +  APn.1,n.1AT)u3'Uj 

=  U A  Uo  P n— l|n— 1  Up  Uj.  (247] 

These  expressions  confirm  that  the  time  updates  propagate  the  transformed  variables. 
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Verification  of  the  correct  variable  propagation  during  the  measurement  update  begins 
by  examining  the  gain  matrix,  which  is  given  as 

G„  =  Bt  2„ 

=  UxUoP^UfUj  (BUj1U^1)T  S„ 

=  U^UoPnln-iBS,, 

=  U*UoGn.  (248) 

The  measurement  update  for  the  mean  is  then 

£n|n  =  (l  —  GnB^  jln ,|n_i  +  GnZn 

=  (l  —  Ua  Uo  GnB  Uo 1  U^1)  Ux  Uo  U0  GnZn 

=  UxUo[  (I- GnB)  tL^.l  +  G„z»] 

=  UAUoMn|n,  (249)' 

and  the  update  for  the  covariance  is 

Pn|n  =  GnB)  P n\n— 1 

=  (l-U/1UoG„BUJ1U^1) 

=  U,1Uo[(l-G„B)P„|n-i]UjU]S 

=  UAU0P„|„UjUj.  (250) 

These  expressions  demonstrate  that  the  measurement  updates  propagate  the  trans¬ 
formed  variables,  thereby  completing  the  invariance  proof. 
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